AttaNet: Attention-Augmented Network for Fast and Accurate Scene Parsing
{Strip Attention Module, Attention Fusion Module}
Keywords: Real-time Semantic Segmentation; Attention; Global-Context; Multi-level semantics.
{Strip Attention Module, Attention Fusion Module}
Keywords: Real-time Semantic Segmentation; Attention; Global-Context; Multi-level semantics.
0) Motivation, Object and Related works:
Motivation:
Generating features that capture Global context and multi-level semantics leads to high computational complexity, which is problematic in real-time scenarios.
Global scene clues: PSPNet (Zhao et al. 2017), DANet (Fu et al. 2019a), and AlignSeg (Huang et al. 2020).
Multi-level representations rely on both semantic information in high-level features and spatial details in low-level features.
Objectives:
Attention-Augmented Network (AttaNet) captures both global context and multilevel semantics while keeping the efficiency high. Two primary modules:
Strip Attention Module (SAM): capture long-range dependencies with only slightly increased computational cost
Attention Fusion Module (AFM): weight the importance of multi-level features during fusion, which attains a multi-level representation effectively and efficiently.
Evaluated on two semantic segmentation benchmarks Cityscapes and ADE20.
AttaNet is a convolutional network that uses a cross-level aggregation architecture. Without loss of generality, pre-trained ResNet is chosen as the backbone by removing the last fully-connected layer, and other CNNs can also be chosen as the backbone.
Related works:
The self-attention Model:
Capture long-range dependencies
Long-short term memory-networks for machine reading_arXiv_2016
Attention is all you need_NIPs_2017
Capture non-local contextual information with limited computation
Nonlocal neural networks_CVPR_2018
Model spatial and channel relationships
A relation-augmented fully convolutional network for semantic segmentation in aerial scenes_CVPR_2019
Use the self-attention mechanism to capture long-range dependencies from all pixels
OCNet: Object Context Network for Scene Parsing_arXiv_2018
Dual attention network for scene segmentation_CVPR_2019
=> Problem: Generate huge attention maps which are computationally expensive.
Leverages two criss-cross attention modules
Ccnet: Criss-cross attention for semantic segmentation_ICCV_2019
Directly exploits class-level context
Acfnet: Attentional class feature network for semantic segmentation_ICCV_2019
=> Propose: Strip Attention Module
Capture long-range relations more effectively and efficiently.
Strengthen the contextual consistency in a specific direction while reduce the size of the affinity map.
Multi-level Feature Fusion:
Combine multi-level representations:
Fully convolutional networks for semantic segmentation_CVPR_2015
Convolutional networks for biomedical image segmentation_ICM_2005
Refinenet: Multi-path refinement networks for high-resolution semantic segmentation_CVPR_2017
Segnet: A deep convolutional encoder-decoder architecture for image segmentation_PAMI_2017
Adaptive context network for scene parsing_ICCV_2019
Adopt the multi-branch framework (an extra branch)
[ICNet]_Icnet for real-time semantic segmentation on high-resolution images_ECCV_2018
Bisenet: Bilateral segmentation network for realtime semantic segmentation_ECCV_2018
BiSeNet V2: Bilateral Network with Guided Aggregation for Real-time Semantic Segmentation_2020
Implement a cross-level feature aggregation architecture with less computation.
Dfanet: Deep feature aggregation for real-time semantic segmentation_CVPR_2019
Semantic Flow for Fast and Accurate Scene Parsing_arXiv_2020
=> Problem: These methods ignore the representation gap among multilevel features, which limits the effectiveness of information propagation.
Uses gates to control information propagation:
GFF: Gated Fully Fusion for Semantic Segmentation_arXiv_2019.
=> Propose: Attention Fusion Module - adopts a lightweight attention strategy to bridge the gap among multilevel features with high adaptability and efficiency.
Real-time Segmentation: Generate high-quality predictions while keeping high inference speed.
Adopt shallow layers on the high-resolution image to speed up
ICNet proposes an image cascade network using multi-resolution images as input to raise efficiency.
BiSeNetV2 introduces a detail branch and a semantic branch to reduce calculation.
Adopt a lightweight backbone to speed up the inference
DFANet and LiteSeg.
=> Propose: Work with large backbone networks while reducing computational complexity and reserving both semantic and spatial information.
1) Method:
1.1) Network Architecture:
Strip Attention Module (SAM): capture long-range relations efficiently.
In SAM, add a Striping layer before the Affinity operation to get the strongest consistency along anisotropy or banded context.
Then, utilize the Affinity operation to find out the long-range relations in the horizontal direction to further enhance the consistency.
Attention Fusion Module (AFM): efficient feature aggregation
In AFM, use an attention strategy to make the model focus on the most relevant features as needed, which bridges the representation gap between multi-level features and enables effective information propagation.
Feature extraction: ResNet
Feature refinement: Deep supervision.
Loss function: L is the joint loss.
Use the principal loss function lp to supervise the output of the whole network.
Add two specific auxiliary loss functions li to supervise the output of the res3 block and AFM.
All the functions are cross-entropy losses.
Use a parameter λ to balance the principal loss and the auxiliary loss.
K and λ are equal to 2 and 1 respectively in our implementation.
where lp is the principal loss of the final output. li is the auxiliary loss for the output of the res3 block and AFM. L is the joint loss. Particularly, all the loss functions are crossentropy losses. K and λ are equal to 2 and 1 respectively in our implementation.
1.2) Strip Attention Module:
Purpose:
Capture non-local contextual relations and also reduce the computational complexity in time and space
Method:
Add a Striping layer before the Affinity operation to get the strongest consistency along with anisotropy or banded context.
Utilize the Affinity operation to find out the long-range relations in the horizontal direction to further enhance the consistency.
Benefit: The benefits of our SAM are three-fold.
First, since the striped feature map is the combination of all pixels along the same spatial dimension, this gives strong supervision in capturing anisotropy or banded context.
Second, we first ensure that the relationships between each pixel and all columns are considered, and then estimate the attention map along the horizontal axis, thus our network can generate dense contextual dependencies.
Moreover, this module adds only a few parameters to the backbone network, and therefore takes up very little GPU memory.
1.3) Attention Fusion Module:
Purpose:
The most commonly used approaches for aggregation are like works (Yu et al. 2018c; Chen et al. 2018), i.e., first upsampling Fl via bilinear interpolation and then concatenating or adding the upsampled Fl and Fl−1 together.
However, low-level features contain excessive spatial details while high-level features are rich in semantics. Simply aggregating multi-level information would weaken the effectiveness of information propagation.
Attention Fusion Module which enables each pixel to choose individual contextual information from multi-level features in the aggregation phase.
Method:
Given two adjacent feature maps Fl and Fl−1
Upsample Fl to the same size as Fl−1 by the standard bilinear interpolation.
Feed Fl−1 into a 3 × 3 convolutional layer (with BN and ReLU).
Then the upsampled Fl is concatenated with the local feature Fl−1 , and the concatenated features are fed to a 1 × 1 convolutional layer.
After that, we leverage a global average pooling operation followed by a convolutional layer with kernel size of 1 × 1 to predict the relative attention mask α.
After obtaining two attention maps, we further perform pixel-wise product between masks and the predictions followed by pixel-wise summation among them to generate the final results, i.e.,
This module employs the relative attention mask of adjacent features to guide the response of both features. This way, it bridges the semantic and resolution gap between multi-level features compared to the simple combination.
References:
Strip Pooling: Rethinking Spatial Pooling for Scene Parsing [Paper]