Beyond Self-attention: External Attention using Two Linear Layers for Visual Tasks
{Attention, Transformer, Multi-Layer Perceptrons}
0) Motivation, Object and Related works:
Motivation:
Attention mechanisms, especially self-attention, have played an increasingly important role in deep feature representation for visual tasks.
Self-attention updates the feature at each position by computing a weighted sum of features using pair-wise affinities across all positions to capture the long-range dependency within a single sample.
However, self-attention has quadratic complexity and ignores the potential correlation between different samples.
Objectives:
Proposes external attention, based on two external, small, learnable, shared memories, which can be implemented easily by using two cascaded linear layers and two normalization layers;
Conveniently replaces self-attention in existing popular architectures.
External attention has linear complexity and implicitly considers the correlations between all data samples.
We further incorporate the multi-head mechanism into external attention to provide an all-MLP architecture, external attention MLP (EAMLP), for image classification.
Our method provides results comparable or superior to the self-attention mechanism and some of its variants, with much lower computational and memory costs.
Related Works:
The attention mechanism in visual tasks
The attention mechanism can be viewed as a mechanism for reallocating resources according to the importance of activation. It plays an important role in the human visual system. There has been vigorous development of this field in the last decade [3], [13], [14], [15], [16], [17], [18]. Hu et al. proposed SENet [15], showing that the attention mechanism can reduce noise and improve classification performance. Subsequently, many other papers have applied it to visual tasks. Wang et al. presented non-local networks [3] for video understanding, Hu et al. [19] used attention in object detection, Fu et al. proposed DANet [4] for semantic segmentation, Zhang et al. [11] demonstrated the effectiveness of the attention mechanism in image generation, and Xie et al. proposed A-SCN [20] for point cloud processing.
Self-attention in visual tasks
Self-attention is a special case of attention, and many papers [3], [4], [11], [17], [21], have considered the self-attention mechanism for vision. The core idea of self-attention is calculating the affinity between features to capture longrange dependencies. However, as the size of the feature map increases, the computing and memory overheads increase quadratically. To reduce computational and memory costs, Huang et al. [5] proposed criss-cross attention, which considers row attention and column attention in turn to capture the global context. Li et al. [6] adopted expectation maximization (EM) clustering to optimize self-attention. Yuan et al. [7] proposed use of object-contextual vectors to process attention; however, it depends on semantic labels. Geng et al. [8] show that matrix decomposition is a better way to model the global context in semantic segmentation and image generation. Other works [22], [23] also explore extracting local information by using the self-attention mechanism. Unlike self-attention which obtains an attention map by computing affinities between self queries and self keys, our external attention computes the relation between self queries and a much smaller learnable key memory, which captures the global context of the dataset. External attention does not rely on semantic information and can be optimized by the back-propagation algorithm in an end-to-end way instead of requiring an iterative algorithm.
Transformer in visual tasks
Transformer-based models have had great success in natural language processing [1], [2], [16], [24], [25], [26], [27]. Recently, they have also demonstrated huge potential for visual tasks. Carion et al. [28] presented an end-to-end detection transformer that takes CNN features as input and generates bounding boxes with a transformer. Dosovitskiy [18] proposed ViT, based on patch encoding and a transformer, showing that with sufficient training data, a transformer provides better performance than a traditional CNN. Chen et al. [29] proposed iGPT for image generation based on use of a transformer. Subsequently, transformer methods have been successfully applied to many visual tasks, including image classification [12], [30], [31], [32], object detection [33], lowerlevel vision [34], semantic segmentation [35], tracking [36], video instance segmentation [37], image generation [38], multimodal learning [39], object re-identification [40], image captioning [41], point cloud learning [42] and self-supervised learning [43]. Readers are referred to recent surveys [44], [45] for a more comprehensive review of the use of transformer methods for visual tasks.
1) Methodology:
1.1 Self-Attention and External Attention:
Self-attention requires first calculating an attention map by computing the affinities between self query vectors and self key vectors, then generating a new feature map by weighting the self value vectors with this attention map.
External attention works differently. We first calculate the attention map by computing the affinities between the self query vectors and an external learnable key memory, and then produce a refined feature map by multiplying this attention map by another external learnable value memory
Self-Attention
Simplified Self-Attention:
External Attention:
1.2 Double-normalization:
1.3 Multi-head external attention:
In Transformer [16], self-attention is computed many times on different input channels, which is called multi-head attention. Multi-head attention can capture different relations between tokens, improving upon the capacity of single head attention. We use a similar approach for multi-head external attention as shown in Algorithm 2 and Fig. 2. Multi-head external attention can be written as: Z
where hi is the i-th head, H is the number of heads and Wo is a linear transformation matrix making the dimensions of input and output consistent. Mk ∈ R^S×d and Mv ∈ R^S×d are the shared memory units for different heads.
The flexibility of this architecture allows us to balance between the number of head H and number of elements S in shared memory units. For instance, we can multiply H by k while dividing S by k.
2) Results
References:
https://zhuanlan.zhihu.com/p/370494166