Rotate to Attend: Convolutional Triplet Attention Module
{, }
0) Motivation, Object and Related works:
Motivation:
Benefiting from the capability of building interdependencies among channels or spatial locations, attention mechanisms have been extensively studied and broadly used in a variety of computer vision tasks recently.
Objectives:
Present triplet attention - light-weight but effective attention mechanisms.
a novel method for computing attention weights by capturing crossdimension interaction using a three-branch structure. For an input tensor, triplet attention builds inter-dimensional dependencies by the rotation operation followed by residual transformations and encodes inter-channel and spatial information with negligible computational overhead. Our method is simple as well as efficient and can be easily plugged into classic backbone networks as an add-on module. We demonstrate the effectiveness of our method on various challenging tasks including image classification on ImageNet-1k and object detection on MSCOCO and PASCAL VOC datasets.
Furthermore, we provide extensive insight into the performance of triplet attention by visually inspecting the GradCAM and GradCAM++ results. The empirical evaluation of our method supports our intuition on the importance of capturing dependencies across dimensions when computing attention weights
Related Works:
Attention in human perception relates to the process of selectively concentrating on parts of the given information while ignoring the rest. This mechanism helps in refining perceived information while retaining the context of it. Over the last few years, several researched methods have proposed to efficiently incorporate this attention mechanism in deep convolution neural network (CNN) architectures to improve performance on large-scale vision tasks. In the following part of this section, we will review some attention mechanisms that are strongly related to this work.
Residual Attention Network [32] proposes a trunk-andmask encoder-decoder style module to generate robust three-dimensional attention maps. Due to the direct generation of 3D attention maps, the method is quite computationally complex as compared to the recently proposed methods to compute attention. This was followed by the introduction of Squeeze-and-Excitation Networks (SENet) [14] which as debated by many was the first to successfully implement an efficient way of computing channel attention while providing significant performance improvements. The aim of SENet was to model the cross-channel relationships in feature maps by learning per-channel modulation weights. Succeeding SENet, Convolutional Block Attention Module (CBAM) [34] was proposed, in which they enrich the attention maps by adding max pooled features for the channel attention along with an added spatial attention component. This combination of spatial attention and channel attention demonstrated substantial improvement in performance as compared to SENet. More recently, Double Attention Networks (A2 -Nets) [6] introduced a novel relation function for Non-Local (NL) blocks. NL blocks [33] were introduced to capture long range dependencies via non-local operations and were designed to be lightweight and easy to use in any architecture. Global Second order Pooling Networks (GSoP-Net) [10] uses second-order pooling for richer feature aggregation. The key idea is to gather important features from the entire input space using second order pooling and subsequently distributing them to make it easier for further layers to recognize and propagate. Global-Context Networks (GC-Net) [2] propose a novel NL-block integrated with a SE block in which they aimed to combine contextual representations with channel weighting more efficiently. Instead of simple downsampling by GAP as in the case of SENet [14], GC-Net uses a set of complex permutationbased operations to reduce the feature maps before passing it to the SE block.
Attention mechanisms have also been successfully used for image segmentation and fine grained image classification. Criss-Cross Networks (CCNet) [15] and SPNet [13] present novel attention blocks to capture rich contextual information using intersecting strips. Xiao et al. [36] propose a pipeline integrated with one bottom-up and two top-down attention for fine grained image classification. Cao et al. [1] introduce the ’Look and Think Twice’ mechanism which is based on a computational feedback process inspired from the human visual cortex which helps in capturing visual attention on target objects even in distorted background conditions.
Most of the above methods have significant shortcomings which we address in our method. Our triplet attention module aims to capture cross-dimension interaction and thus is able to provide significant performance gains at a justified negligible computational overhead as compared to the above described methods where none of them account for cross-dimension interaction while allowing some form of dimensionality reduction which is unnecessary to capture cross-channel interaction.
1) Proposed Method:
In this section, we first revisit CBAM [34] and analytically diagnose the efficiency of the shared MLP structure within the channel attention module of CBAM. Subsequently, we propose our triplet attention module where we demonstrate the importance of cross-dimension dependencies and further compare the complexity of our method with other standard attention mechanisms. Finally, we conclude by showcasing how to adapt triplet attention into standard deep CNN architectures for different challenging tasks in the domain of computer vision.
1.1 Revisiting Channel Attention in CBAM:
Attention mechanisms have also been successfully used for image segmentation and fine grained image classification. Criss-Cross Networks (CCNet) [15] and SPNet [13] present novel attention blocks to capture rich contextual information using intersecting strips. Xiao et al. [36] propose a pipeline integrated with one bottom-up and two top-down attention for fine grained image classification. Cao et al. [1] introduce the ’Look and Think Twice’ mechanism which is based on a computational feedback process inspired from the human visual cortex which helps in capturing visual attention on target objects even in distorted background conditions.
Most of the above methods have significant shortcomings which we address in our method. Our triplet attention module aims to capture cross-dimension interaction and thus is able to provide significant performance gains at a justified negligible computational overhead as compared to the above described methods where none of them account for cross-dimension interaction while allowing some form of dimensionality reduction which is unnecessary to capture cross-channel interaction.