Learning Visual Context by Comparison
Keywords: Context Modeling, Attention Mechanisms, Chest X-Ray
0. Motivation, Objective, Contributions and Related Works:
Motivation:
Finding diseases from an X-ray image is an important yet highly challenging task.
Missing the most important characteristics: the necessity of comparison between related regions in an image.
Objectives:
Presented the Attend-and-Compare Module (ACM) for capturing the difference between an object of interest and its corresponding context.
+ Explicit difference modeling can be very helpful in tasks that require direct comparison between locations from afar.
+ This module can be plugged into existing deep learning models.
ACM is validated over three chest X-ray datasets [37] and COCO dataset [24] with various backbones such as ResNet [14], ResNeXt [40] or DenseNet [16].
The explicit comparison process by ACM indeed improves the recognition performance.
Contributions:
Propose a novel context module called ACM that explicitly compares different regions, following the way radiologists read chest X-rays.
The proposed ACM captures multiple comparative self-attentions whose difference is beneficial to recognition tasks.
We demonstrate the effectiveness of ACM on three chest X-ray datasets [37] and COCO detection & segmentation dataset [24] with various architectures.
Related Works:
Context Modeling: conducted with the self-attention mechanism [33,15,22,30] (make use of global information):
Self-attention mechanisms [15,34,7,22,38] generate dynamic attention maps for recalibration (e.g., emphasize salient regions or channels):
1) Squeeze-and-Excitation network (SE) [15] learns to model channel-wise attention using the spatially averaged feature.
2) A Style-based Recalibration Module (SRM) [22] further explores the global feature modeling in terms of style recalibration.
3) Convolutional block attention module (CBAM) [38] extends SE module to the spatial dimension by sequentially attending the important location and channel given the feature.
Adv: The attention values are computed with global or larger receptive fields, and thus, more contextual information can be embedded in the features.
Dis-adv: As the information is aggregated into a single feature by average or similar operations, spatial information from the relationship among multiple locations may be lost.
Using pixel-level pairwise relationships to explicitly tackle the problem of using context stem[36,17,4]. (focus on long-range dependencies and explicitly model the context aggregation from dynamically chosen locations):
1) Non-local neural networks (NL) [36] calculate pixel-level pairwise relationship weights and aggregate (weighted average) the features from all locations according to the weights.
2) Global-Context network (GC) [4] challenges the necessity of using all pairwise relationships in NL and suggests to softly aggregate a single distinctive context feature for all locations.
3) Criss-cross attention (CC) [17] for semantic segmentation reduces the computation cost of NL by replacing the pairwise relationship attention maps with criss-cross attention block which considers only horizontal and vertical directions separately.
Adv: NL and CC explicitly model the pairwise relationship between regions with affinity metrics.
Dis-adv: The qualitative results in [36,17] demonstrate a tendency to aggregate features only among foreground objects or among pixels with similar semantics.
Contrastive attention [31,42]:
+ MGCAM [31] uses the contrastive feature between persons and backgrounds, but it requires extra mask supervision for persons.
+ C-MWP [42] is a technique for generating more accurate localization maps in a contrastive manner, but it is not a learning-based method and uses pre-trained models.
==> Inspired by how radiologists diagnose, ACM (proposed module) explicitly models a comparing mechanism:
Stems from the precise need for incorporating difference operation in reading chest radiographs.
Not finding an affinity map-based attention as in NL [36], ACM explicitly uses direct comparison procedure for context modeling;
Not using extra supervision to localize regions to compare as in MGCAM [31], ACM automatically learns to focus on meaningful regions to compare.
The efficacy of our explicit and data-driven contrastive modeling is shown by the superior performance over other context modeling works.
Chest X-ray as a Context-Dependent Task:
Commonly occurring diseases can be classified and located in a weakly-supervised multi-label classification framework.
+ ResNet [14] and DenseNet [16,29] pretrained on ImageNet [8] have set a strong baseline for these tasks, and other studies have been conducted on top of them to cover various issues of recognition task in the chest X-ray modality.
To address the issue of localizing diseases using only class-level labels:
+ Guendel et al. [12] propose an auxiliary localization task where the ground truth of the location of the diseases is extracted from the text report.
+ [35,32,10] use the attention module to indirectly align the class-level prediction with the potentially abnormal location without the text reports on the location of the disease.
+ It is helpful to leverage both a small number of location annotations and a large number of class-level labels to improve both localization and classification performances [23].
+ Guan et al. [11] also proposes a hierarchical hard-attention for cascaded inference.
==> The difference between an object of interest and a corresponding context could be the crucial key for classifying or localizing several diseases as it is important to compare semantically meaningful locations.
==> Our work is the first to utilize this characteristic (capturing the semantic difference between regions in chest X-ray recognition tasks) in the Chest X-ray image recognition setting
Chest X-ray: One of the most common and readily available examinations for diagnosing chest diseases.
Purpose: In the US, more than 35 million chest X-rays are taken every year (used to screen diseases such as lung cancer, pneumonia, tuberculosis and pneumothorax)
Problem: the heavy workload of reading chest X-rays. Radiologists usually read tens or hundreds of X-rays every day. Several studies regarding radiologic errors [28,9] have reported that 20-30% of exams are misdiagnosed.
Method: Many hospitals equip radiologists with computer-aided diagnosis systems. The recent developments of medical image recognition models have shown potentials for growth in diagnostic accuracy [26].
Find thoracic diseases from chest X-rays using deep learning [41,12,23,29].
Some classify thoracic diseases, and others further localize the lesions.
Yao et al. [41] handles varying lesion sizes
Mao et al. [25] takes the relation between X-rays of the same patient into consideration.
Wang et al. [35] introduces an attention mechanism to focus on regions of diseases.
The way radiologists read X-rays?
When radiologists read chest X-rays, they compare zones [1], paying close attention to any asymmetry between left and right lungs, or any changes between semantically related regions, that are likely to be due to diseases.
This comparison process provides contextual clues for the presence of a disease that local texture information may fail to highlight.
Fig. 1 illustrates an example of the process. Previous studies [36,15,4,38] proposed various context models, but none addressed the need for the explicit procedure to compare regions in an image.
Attend-and-Compare Module (ACM):
Mimicking the way radiologists read X-rays: Extracts features of an object of interest and a corresponding context to explicitly compare them by subtraction.
No explicit constraints for symmetry, and learns to compare regions in a data-driven way.
Fig. 1: An example of a comparison procedure for radiologists. Little differences indicate no disease (blue), the significant difference is likely to be a lesion (red).
1) Attend-and-Compare Module (ACM):
Fig. 2: Illustration of the ACM module. It takes in an input feature and uses the mean-subtracted feature to calculate two feature vectors (K, Q). Each feature vector (K or Q) contains multiple attention vectors from multiple locations calculated using grouped convolutions and normalizations. The difference of the vectors is added to the main feature to make the information more distinguishable. The resulting feature is modulated channel-wise, by the global information feature.
1.1 Overview:
Attend-and-Compare Module (ACM) extracts an object of interest and the corresponding context to compare, and enhances the original image feature with the comparison result. Also, ACM is designed to be light-weight, self-contained, and compatible with popular backbone architectures [14,16,40]. We formulate ACM comprising three procedures as:
Y = fACM(X) = P(X + (K − Q)),
where fACM is a transformation mapping an input feature X ∈ R C×H×W to an output feature Y ∈ R C×H×W in the same dimension. Between K ∈ R C×1×1 and Q ∈ R C×1×1 , one is intended to be the object of interest and the other is the corresponding context. ACM compares the two by subtracting one from the other, and add the comparison result to the original feature X, followed by an additional channel re-calibration operation with P ∈ R C×1×1 . Fig. 2 illustrates Equation (1). These three features K, Q and P are conditioned on the input feature X and will be explained in details below.
1.2 Components:
Object of Interest and Corresponding Context
To fully express the relationship between different spatial regions of an image, ACM generates two features (K, Q) that focus on two spatial regions of the input feature map X. At first, ACM normalizes the input feature map as X := X − µ where µ is a Cdimensional mean vector of X. We include this procedure to make training more stable as K and Q will be generated by learnable parameters (WK, WQ) that are shared by all input features. Once X is normalized, ACM then calculates K with WK as:
K = X i,j∈H,W exp(WKXi,j ) P H,W exp(WKXh,w) Xi,j ,
where Xi,j ∈ R C×1×1 is a vector at a spatial location (i, j) and WK ∈ R C×1×1 is a weight of a 1 × 1 convolution. The above operation could be viewed as applying 1 × 1 convolution on the feature map X to obtain a single-channel attention map in R 1×H×W , applying softmax to normalize the attention map, and finally weighted averaging the feature map X using the normalized map. Q is also modeled likewise, but with WQ. K and Q serve as features representing important regions in X. We add K − Q to the original feature so that the comparative information is more distinguishable in the feature
Channel Re-calibration In light of the recent success in self-attention modules which use a globally pooled feature to re-calibrate channels [15,22,38], we calculate the channel re-calibrating feature P as
P = σ ◦ conv1×1 2 ◦ ReLU ◦ conv1×1 1 (µ), (3)
where σ and conv1×1 denote a sigmoid function and a learnable 1×1 convolution function, respectively. The resulting feature vector P will be multiplied to X + (K − Q) to scale down certain channels. P can be viewed as marking which channels to attend with respect to the task at hand.
Group Operation To model a relation of multiple regions from a single module, we choose to incorporate group-wise operation. We replace all convolution operations with grouped convolutions [21,40], where the input and the output are divided into G number of groups channel-wise, and convolution operations are performed for each group separately. In our work, we use the grouped convolution to deliberately represent multiple important locations from the input. Here, we compute G different attention maps by applying grouped convolution to X, and then obtain the representation K = [K1 , · · · , KG] by aggregating each group in X with each attention as follows:
Kg = X i,j∈H,W exp(W g KX g i,j ) P H,W exp(W g KX g h,w) X g i,j , (4)
where g refers to g-th group.
Loss Function ACM learns to utilize comparing information within an image by modeling {K, Q} whose difference can be important for the given task. To further ensure diversity between them, we introduce an orthogonal loss. based on a dot product. It is defined as
`orth(K, Q) = K · Q C , (5)
where C refers to the number of channels. Minimizing this loss can be viewed as decreasing the similarity between K and Q. One trivial solution to minimizing the term would be making K or Q zeros, but they cannot be zeros as they come from the weighted averages of X. The final loss function for a target task can be written as
`task + λ X M m `orth(Km, Qm), (6)
where `task refers to a loss for the target task, and M refers to the number of ACMs inserted into the network. λ is a constant for controlling the effect of the orthogonal constraint.
Placement of ACMs In order to model contextual information in various levels of feature representation, we insert multiple ACMs into the backbone network. In ResNet, following the placement rule of SE module [15], we insert the module at the end of every Bottleneck block. For example, a total of 16 ACMs are inserted in ResNet-50. Since DenseNet contains more number of DenseBlocks than ResNet’s Bottleneck block, we inserted ACM in DenseNet every other three DenseBlocks. Note that we did not optimize the placement location or the number of placement for each task. While we use multiple ACMs, the use of grouped convolution significantly reduces the computation cost in each module.
Fig. 3: Examples of pneumothorax cases and annotation maps in Em-Ptx dataset. Lesions are drawn in red. (a) shows a case with pneumothorax, and (b) shows a case which is already treated with a medical tube marked as blue
Fig. 4: Left: The visualized attention maps for the localization task on Em-Ptx dataset. The 11th group in the 16th module is chosen. Em-Ptx annotations are shown as red contours on the chest X-ray image. Right: The visualization on COCO dataset. Groundtruth segmentation annotations for each category are shown as red contours.
2) Experiments :
1) Attend-and-Compare Module:
Proposed a novel self-contained module, named Attend-and-Compare Module (ACM)
Key idea: extract an object of interest and a corresponding context and explicitly compare them to make the image representation more distinguishable.
ACM indeed improves the performance of visual recognition tasks in chest X-ray and natural image domains:
A simple addition of ACM provides consistent improvements over baselines in COCO as well as Chest X-ray14 public dataset and internally collected Em-Ptx and Ndl dataset.
ACM automatically learns dynamic relationships:
The objects of interest and corresponding contexts are different yet contain useful information for the given task.
References:
https://medium.com/@dptmn200/unsupervised-deep-embedding-for-clustering-analysis-a-summary-f6e5f2dce94f