DCANet: Learning Connected Attentions for Convolutional Neural Networks
{, }
0) Motivation, Object and Related works:
Motivation:
While self-attention mechanism has shown promising results for many vision tasks, it only considers the current features at a time.
We show that such a manner cannot take full advantage of the attention mechanism.
Objectives:
Present Deep Connected Attention Network (DCANet), a novel design that boosts attention modules in a CNN model without any modification of the internal structure.
We interconnect adjacent attention blocks, making information flow among attention blocks possible.
With DCANet, all attention blocks in a CNN model are trained jointly, which improves the ability of attention learning.
Our DCANet is generic. It is not limited to a specific attention module or base network architecture.
Experimental results on ImageNet and MS COCO benchmarks show that DCANet consistently outperforms the state-of-the-art attention modules with a minimal additional computational overhead in all test cases. All code and models are made publicly available.
Related works:
Self-attention mechanisms:
Explores the interdependence within the input features for a better representation.
Large range of tasks, from machine translation [3] in natural language processing to object detection [7] in computer vision.
1) Channel interdependencies, SENet [18], GENet [17] and SGENet [20] leverage self-attention for contextual modeling.
2) Global context information, NLNet [38] and GCNet [7] introduce self-attention to capture long-range dependencies in non-local operations.
3) Consider both channel-wise and spatial attentions: BAM [26] and CBAM [39].
4) Beyond channel and spatial dependencies, SKNet [21] applies self-attention to kernel size selection.
Residual connections.
By introducing a shortcut connection, neural networks are decomposed into biased and centered subnets to accelerate gradient descent.
ResNet [14,15] adds an identity mapping to connect the input and output of each convolutional block, which drastically alleviates the degradation problem [14] and opens up the possibility for deep convolutional neural networks.
DenseNet [19] connects each block to every other block in a feed-forward fashion.
FishNet [32] connects layers in pursuit of propagating gradient from deep layers to shallow layers.
DLA [40] shows that residual connection is a common approach of layer aggregation. By iteratively and hierarchically aggregating layers in a network, DLA is able to reuse feature maps generated by each layer.
They are still fairly new when it comes to integration with attention mechanisms.
RANet [37] utilizes residual connections in attention block; [37,26], residual learning is used in attention modules to facilitate the gradient flow.
Connected Attention.
RA-CNN recurrently generates attention regions based on current prediction to learn the most discriminative region. By doing so, RA-CNN obtains an attention region from coarse to fine.
In GANet [5], the top attention maps generated by customized background attention blocks are up-sampled and sent to bottom background attention blocks to guide attention learning.
Different from the recurrent and feed-backward methods, our DCA module enhances attention blocks in a feed-forward fashion, which is more computation-friendly and easier to implement.
1) Deep Connected Attetion:
By analyzing the inner structure of various attention blocks, we propose a generic connection scheme that is not confined to particular attention blocks.
We merge the previous attention features and current extracted features by parameterized addition to ensure the information flow among all attention blocks in a feed-forward manner and prevent attention information from varying a lot in each step.
1.1 Revisiting Self-Attention Blocks
An attention block consists of three components:
Context extraction serves as a simple feature extractor.
Transformation transforms the extracted features
Fusion merges attention and original features
These components are generic and not confined to a particular attention block.
1.2 Attention Connection
An attention block consists of three components:
Context extraction serves as a simple feature extractor.
Transformation transforms the extracted features
Fusion merges attention and original features
These components are generic and not confined to a particular attention block.