DANet: Dual Attention Network for Scene Segmentation
{Self-attention, Spatial attention, Channel attention, Local + Global Features}
0) Motivation, Object and Related works:
Motivation:
In this paper, we address the scene segmentation task by capturing rich contextual dependencies based on the self-attention mechanism.
Unlike previous works that capture contexts by multi-scale features fusion.
Objectives:
Dual Attention Networks (DANet) - Adaptively integrate local features with their global dependencies.
Two types of attention modules:
The position attention module selectively aggregates the features at each position by a weighted sum of the features at all positions. Similar features would be related to each other regardless of their distances.
The channel attention module selectively emphasizes interdependent channel maps by integrating associated features among all channel maps.
We sum the outputs of the two attention modules to further improve feature representation which contributes to more precise segmentation results.
We achieve new state-of-the-art segmentation performance on three challenging scene segmentation datasets, i.e., Cityscapes, PASCAL Context and COCO Stuff dataset.
In particular, a Mean IoU score of 81.5% on Cityscapes test set is achieved without using coarse data.
Related Works:
The mainstream scene segmentation methods can be roughly divided into the following two types: One is to enhance special expressions by using multi-scale feature fusion , such as spatial pyramid structure (PSP, ASPP) or high-level shallow feature fusion (RefineNet). However, these methods do not take into account the correlation between different features, which is really important for the understanding of the scene. The other is to use the RNN network to construct feature associations with a long range of features, but this association is often limited by the long-term memorization of RNNs.
Overview:
As illustrated in Figure. 2, we design two types of attention modules to draw global context over local features generated by a dilated residual network, thus obtaining better feature representations for pixel-level prediction. We employ a pretrained residual network with the dilated strategy [3] as the backbone. Noted that we remove the downsampling operations and employ dilated convolutions in the last two ResNet blocks, thus enlarging the size of the final feature map size to 1/8 of the input image. It retains more details without adding extra parameters. Then the features from the dilated residual network would be fed into two parallel attention modules
The spatial attention modules: we first apply a convolution layer to obtain the features of dimension reduction. Then we feed the features into the position attention module and generate new features of spatial long-range contextual information through the following three steps. The first step is to generate a spatial attention matrix which models the spatial relationship between any two pixels of the features. Next, we perform a matrix multiplication between the attention matrix and the original features. Third, we perform an element-wise sum operation on the above multiplied resulting matrix and original features to obtain the final representations reflecting longrange contexts.
The channel attention modules: Long-range contextual information in channel dimension are captured by a channel attention module. The process of capturing the channel relationship is similar to the position attention module except for the first step, in which channel attention matrix is calculated in channel dimension. Finally we aggregate the outputs from the two attention modules to obtain better feature representations for pixel-level prediction.
1) Spatial attention mechanism:
We introduce the self-attention mechanism to capture the spatial dependencies between any two positions of the feature maps.
For the feature at a certain position, it is updated via aggregating features at all positions with weighted summation, where the weights are decided by the feature similarities between the corresponding two positions.
That is, any two positions with similar features can contribute mutual improvement regardless of their distance in spatial dimension.
Motivation:
Discriminant feature representations are essential for scene understanding, which could be obtained by capturing long-range contextual information.
However, many works [15, 30] suggest that local features generated by traditional FCNs could lead to misclassification of objects and stuff.
Objectives:
Introduce a position attention module in order to model rich contextual relationships over local features.
The position attention module encodes a wider range of contextual information into local features, thus enhancing their representation capability. Next, we elaborate the process to adaptively aggregate spatial contexts.
It can be inferred from Equation 2 that the resulting feature E at each position is a weighted sum of the features across all positions and original features.
Therefore, it has a global contextual view and selectively aggregates contexts according to the spatial attention map. The similar semantic features achieve mutual gains, thus improving intra-class compact and semantic consistency.
Assuming that the feature map A is 3*2*2 (C * H*W), B, C, D obtained after convolution are 3*4 (C*N) after reshape, N=H*W
From the results, we can see that each row of DS is actually a weighted expansion of each channel: that is, (1,1) is the spatial weight relationship between the first pixel and other pixels in the first channel; (1,2) is the spatial weight relationship between the second pixel in the first channel and other pixels (2, 1) is the spatial weight relationship between the first pixel and other pixels in the second channel...
After being reshaped into a 3×2×2 (C×H×W) matrix and then added to the original A, the spatial attention feature map can be obtained.
2) Channel attention mechanism:
We use similar self-attention mechanism to capture the channel dependencies between any two channel maps, and update each channel map with a weighted sum of all channel maps.
Motivation:
Each channel map of high-level features can be regarded as a class-specific response, and different semantic responses are associated with each other. By exploiting the interdependencies between channel maps, we could emphasize interdependent feature maps and improve the feature representation of specific semantics.
Objectives:
We build a channel attention module to explicitly model interdependencies between channels.
The Equation 4 shows that the final feature of each channel is a weighted sum of the features of all channels and original features, which models the long-range semantic dependencies between feature maps.
It helps to boost feature discriminability.
Noted that we do not employ convolution layers to embed features before computing relationshoips of two channels, since it can maintain relationship between different channel maps. In addition, different from recent works [28] which explores channel relationships by a global pooling or encoding layer, we exploit spatial information at all corresponding positions to model channel correlations.
The channel attention is the same. After some reshape, transpose, and softmax, the channel feature map X is obtained.
X and M (A is reshape to get M) convolution result (C×N).
It can be seen in the figure that each position is the weight of three channels. The result is reshaped into C×H×W and then added to the original A to obtain the channel attention feature map
3) Attention Module Embedding with Networks:
In order to take full advantage of long-range contextual information, we aggregate the features from these two attention modules.
Specifically, we transform the outputs of two attention modules by a convolution layer and perform an element-wise sum to accomplish feature fusion.
At last a convolution layer is followed to generate the final prediction map. We do not adopt cascading operation because it needs more GPU memory.
Noted that our attention modules are simple and can be directly inserted in the existing FCN pipeline. They do not increase too many parameters yet strengthen feature representations effectively.
4) Results: