AugFPN: Improving Multi-Scale Feature Learning for Object Detection

In this post,

We reveal the issues in three different stages of FPN that prevent the multi-scale features from being fully exploited.
A new feature pyramid network named AugFPN is proposed to address these problems with Consistent Supervision, Residual Feature Augmentation, and Soft RoI Selection respectively.
We evaluate AugFPN equipped with various detectors and backbones on MS COCO and it consistently brings significant improvements over FPN based detectors.

Conclusion:

In this paper, we analyze the inherent problems along with FPN and find that the multi-scale features are not fully exploited. Based on this observation, we propose a new feature pyramid network named AugFPN to further exploit the potential of multi-scale features. By integrating three simple yet effective components, i.e. Consistent Supervision, Residual Feature Augmentation, and Soft RoI Selection, AugFPN can improve the baseline method by a large margin on the challenging MS COCO dataset.

[Paper] [Code]

0. Motivation, Objective and Related Works:

Motivation:

There is a semantic gap between different layers, directly adding different layers ignores the semantic gap.
Fusing information in top-down lead sto the loss of high-level information.
Heuristical assignment strategy of RoIs:
- The ROI of the big target tends to be detected from the low-level features.
- The ROI of the small target tends to be detected from the high-level features.

Goal:

Improving the representation of multi-scale features.
Existing SOTA detectors all use the construction of feature pyramids (feature combinations of different scales) to improve the robustness of detecting targets of different scales (sizes). And FPN is one of the representative work.

Objective:

Consistent Supervision (solve semantic gaps between different layers): to address the semantic gap (adds the features of each layer before fusion).
Residual Feature Augmentation (solve loss of high-level information from top-down pathway): to complement the information loss.
Soft ROI Selection: to generate RoI feature adaptively.

Change the FPN structure in FasterRCNN to AugFPN, and both have mAP improvements on ResNet50 and MobileNet-v2

In addition, the improvement points related to the feature pyramid in AugFPN (that is, Consistent Supervision & Residual Feature Augmentation) can also be improved in some Anchor Based/Anchor Free One-stage methods

Analysis of defects in native FPN:

For the fusion & utilization of features of different scales in FPN, Figure 1 summarizes it. For a Backbone, FPN will first lead to its output in different scales, these outputs have such characteristics:

The feature scale (size) of the bottom layer is large, and the semantic information contained is not rich enough, and often only a very limited convolution kernel (layer) is used to obtain some bottom layer (such as edges, corners) and other information. In the design of many networks, in order to take into account the computing power, the usual number of channels of the underlying features will be relatively small (because the computing power of convolution at this size will increase sharply, specifically combining the time complexity of the convolution to analyze)
The high-level feature scale is small, because it has passed through a relatively large number of convolutional layers, it will contain more semantic information, and the number of channels tends to be relatively large.

Feature fusion of two adjacent scales, FPN first reduces the dimensionality of the high-level features through 1x1 convolution, so that it has the same number of channels as the features of the next layer (adaptation in the number of dimensions/channels), and then passes 2 Double upsampling + 3x3 convolution for scale adaptation, and finally fusion of the adapted features in the dimension & scale with the upper layer scale (directly add, because the shapes of the two features are already exactly the same

The problem with the above-mentioned feature fusion method (two convolutions are adapted and directly added) is:

Information loss in the process of feature fusion of adjacent scales: In the process of high-level -> low-level adaptation, the 1x1 convolution will lose information (here mainly refers to semantic information), because the number of channels is reduced
Loss of high-level features: In the highest-level features, because no other features are fused with it, and it is directly subjected to a 1x1 convolution for dimensionality reduction, the reduction in the number of channels will obviously lose information. The article pointed out that this kind of information can be solved by Global Pooling, but a one-dimensional vector obtained by Global Pooling obviously loses spatial information (eg the relative position of the target cannot be represented by such a one-dimensional vector)
There is a problem with the selection of ROI features: FPN believes that the bottom-level information after fusion contains more features related to small targets, and the scale is larger, the information is more delicate, and the small targets are more sensitive, so the output in the bottom-level features is often small The ROI characteristics of the target, and vice versa. But the problem is that other layers also contain some semantic information about the object. This article mentions a mitigating solution: PANet takes full advantage of the ROI at each scale by extracting ROI features from each object, and then using the full link method. But such problems are:
1. After the full link is connected to a max pooling, it will also lose some network feature output with small response, and these outputs may also be helpful for classification
2. The amount of parameters in the fully connected layer is too large

Improvements to the above 3 defects

Consistent Supervision: Used to reduce the semantic gap between different scales
Residual Feature Augmentation: used to reduce information loss in fusion, summation of different scales
Soft RoI Selection: better extract ROI Feature from the image pyramid for classification

Current state-of-the-art detectors typically exploit feature pyramid to detect objects at different scales. Among them, FPN is one of the representative works that build a feature pyramid by multi-scale features summation. However, the design defects behind prevent the multi-scale features from being fully exploited. In this paper, we begin by first analyzing the design defects of feature pyramid in FPN, and then introduce a new feature pyramid architecture named Augmented FPN (AugFPN) to address these problems. Specifically, AugFPN consists of three components: Consistent Supervision, Residual Feature Augmentation, and Soft RoI Selection. AugFPN narrows the semantic gaps between features of different scales before feature fusion through Consistent Supervision. In feature fusion, ratio-invariant context information is extracted by Residual Feature Augmentation to reduce the information loss of feature map at the highest pyramid level. Finally, Soft RoI Selection is employed to learn a better RoI feature adaptively after feature fusion. By replacing FPN with AugFPN in Faster R-CNN, our models achieve 2.3 and 1.6 points higher Average Precision (AP) when using ResNet50 and MobileNet-v2 as backbone respectively. Furthermore, AugFPN improves RetinaNet by 1.6 points AP and FCOS by 0.9 points AP when using ResNet50 as backbone. Codes will be made available.

Adaptation for One-stage:

The article pointed out that this concept is also applicable to One-stage algorithms, such as retinanet

The part behind ROIAlign in AugFPN, that is, Soft ROI Selection is not useful in training

Consistent Supervision and Residual Feature Augmentation are still available

to sum up:

This article analyzes the defects exposed by FPN during feature fusion: the core is the lack of information generated during the fusion process

Some improvement points are proposed for the lack of information, which mainly include three aspects:

Compensate the loss of information at the top level due to dimensionality reduction before fusion

Compensate for the loss of semantic information caused by fusion of adjacent features

Compensate for the loss of information caused by taking ROI Feature from a single-level pyramid

Related Work

Deep Object Detectors. Contemporary object detection methods almost follow two paradigms, two-stage and onestage. As a seminal work of the two-stage detection methods [10, 9, 33, 4, 21, 1, 35, 19, 20, 28], R-CNN [10] first employs selective search [37] to generate region proposals and then refines these proposals by extracting region features through a convolutional network. To improve the training and inference speed, SPP [13] and Fast R-CNN [9] first extract feature map of the whole image and then generate region features with spatial pyramid pooling and RoI pooling respectively. Finally the region fetures are used to refine the proposals. Faster R-CNN [33] proposes a region proposal network and develops an end-to-end trainable detector, which promotes the performance significantly and speed-up the inference. To pursue scale-invariance in object detection, FPN [21] builds an in-network feature pyramid based on the inherent feature hierarchy of convolution network and makes predictions at different pyramid levels according to the scales of region proposals. RoI Align [12] brings great improvement in both object detection and instance segmentation by addressing the quantization problem of RoI pooling. Deformable network [5, 42] improve the performance of object detection significantly by modeling the geometry structure of objects. Cascade R-CNN [1] introduces a multi-stage refinement into Faster R-CNN and achieves more accurate predictions of object locations. Contrary to two-stage detectors, one-stage detectors [25, 30, 6, 31, 22, 17, 23, 32, 39, 41] are more efficient yet less accurate. SSD [25] places anchor boxes densely on multi-scale features and make predictions based on these anchors. RetinaNet [22] utilizes a feature pyramid similar to FPN as backbone and introduces a novel focal loss to address the imbalance problem of easy and hard examples. ExtremeNet[41] models the problem of object detection as detecting four extreme points of the objects. These works make significant progress from different concerns. In this paper, we study a better exploitation of multi-scale features.

Deep Supervision. Deep supervision [15, 18, 40, 7] is a wildly used technique to tackle the common problem of gradient vanishing or enhance the feature representation of intermediate layers. Huang et al. [15] incorporate several classifiers with various resource demands into a single deep network by training it at different levels simultaneously. PSPNet [40] introduces an additional pixel-level loss on intermediate layers in order to reduce the optimizing difficulty. Recently Nas-FPN [7] attaches classifier and regression heads after all intermediate pyramid networks with a goal of achieving anytime detection. Contrary to these works, we apply the instance-level supervision signals on features at all pyramid levels after lateral connection, aiming to narrow the semantic gaps between them and make the features more suitable for subsequent feature summation.

Context Exploitation. Several methods have proved the importance of context on both object detection [8, 29, 38] and segmentation [16, 26, 40]. Deeplab-v2 [3] proposes atrous convolution to extract multi-scale context and PSPNet [40] utilizes pyramid pooling to obtain hierarchical global context, both of which improve the quality of semantic segmentation greatly. Different from them, we perform ratio-invariant adaptive pooling to generate diverse spatial context information and utilize them to reduce information loss in channels of the feature at the highest pyramid level in a residual way.

Strategy of RoI Assignment. FPN [21] pools RoI features from one certain pyramid level, which is chosen according to the scales of RoIs. However, two proposals with a similar scale can be assigned to different feature levels under this strategy, which may produce sub-optimal results. To address this, PANet pools RoI features from all pyramid levels and fuses them by max operation after adapting them with a fully connected layer independently. There is a distinct difference between PANet and our work that we propose a data-dependent way to generate adaptive weights and absorb features from all levels according to the weights. This enables the features at different levels to be better exploited. In addition, our work requires fewer parameters because no extra fully connected layers are required to adapt RoI features.

With the significant advances in deep convolutional networks (ConvNets), remarkable progress has been achieved in image object detection. A number of detectors [10, 33, 9, 25, 30, 12, 21, 22] have been proposed to steadily push forward the state-of-the-art. Among these detectors, FPN [21] is a simple and effective two-stage framework for object detection. Specifically, FPN builds a feature pyramid upon the inherent feature hierarchy in ConvNet by propagating the semantically strong features from high levels into features at lower levels. By improving multi-scale features with strong semantics, the performance of object detection has been substantially improved. However, there exist some design defects within the feature pyramid in FPN, which is illustrated in Fig. 1. Basically, the feature pyramid in FPN can be formulated into three stages: (1) before feature fusion, (2) topdown feature fusion, and (3) after feature fusion. We find that each stage has an intrinsic flaw as described in the following:

Semantic gaps between features at different levels. Before performing feature fusion, features at different levels undergo a 1 × 1 convolution layer independently to reduce feature channels, where the large semantic gaps between these features are not considered. Fusing these features directly would degrade the power of multi-scale feature representation due to the inconsistent semantic information.

Information loss of the highest-level feature map. In feature fusion, features are propagated in a top-down path and low-level features can be improved with the stronger semantic information from higher-level features. Nevertheless, the feature at the highest pyramid level instead loses information due to the reduced channels. The information loss can be mitigated by combining the global context feature [29] extracted by global pooling. But such a strategy of fusing the feature map into one single vector may lose the spatial relation and details because multiple objects may appear in one image.

Heuristical assignment strategy of RoIs. After feature fusion, each object proposal is refined based on the feature grids pooled from one feature level, which is chosen based on the scales of proposals heuristically. However, the ignored features from other levels may be beneficial for object classification or regression. Considering this problem, PANet [24] pools RoIs features from all pyramid levels and fuses them with the max operation after adapting them with independent fully connected layers. Nevertheless, the max fusion would ignore features with smaller responses that may be also helpful and still does not exploit the features at other levels fully. Meanwhile, the extra fully connected layers increase the model parameters significantly.

In this paper, we propose AugFPN, a simple yet effective feature pyramid that integrates three different components to deal with the problems above respectively. First, Consistent Supervision is proposed to make the feature maps after lateral connection contain similar semantic information by enforcing the same supervision signals on these feature maps. Second, ratio-invariant adaptive pooling is utilized to extract diverse context information, which could reduce information loss of the highest-level feature in feature pyramid in a residual way. We name this procedure as Residual Feature Augmentation. Third, Soft RoI Selection is introduced to better exploit RoI features from different pyramid levels and produce a better RoI feature for subsequent location refinement and classification.

Without bells and whistles, AugFPN based Faster R-CNN outperforms FPN based counterparts by 2.3 and 1.7 Average Precision (AP) when using ResNet50 and ResNet101 as backbone respectively. Furthermore, AugFPN improves the overall performance by 1.6 AP when the backbone is changed to MobileNet-V2, which is a lightweight and efficient network. AugFPN can also be extended to one-stage detectors with minor modifications. By replacing FPN with AugFPN, RetinaNet and FCOS are improved by 1.6 AP and 0.9 AP respectively, which manifests the generality of AugFPN.

1. Methodology:

The overall framework of AugFPN is shown in Figure 2. Following the setting of FPN [21], features used to build the feature pyramid are denoted as {C2, C3, C4, C5}, which correspond to the feature maps with strides {4, 8, 16, 32} pixels in feature hierarchy w.r.t. the input image. {M2, M3, M4, M5} are the features with reduced feature channels after lateral connection. {P2, P3, P4, P5} are the features produced by feature pyramid. Three components of AugFPN will be discussed in the following subsections.

a) Consistent Supervision:

FPN makes use of the in-network feature hierarchy that produces feature maps with different resolutions to build a feature pyramid. In order to integrate the multi-scale context information, FPN fuses features of different scales by upsampling and summation in a top down path. However, the features with different scales contain information at different abstract levels and there exist large semantic gaps between them. Although the method adopted by FPN is simple and effective, fusing multiple features with large semantic gaps would lead to a sub-optimal feature pyramid.

This inspires us to propose Consistent Supervision, which enforces the same supervision signals on the multi-scale features before fusion, with the goal of narrowing semantic gaps between them. Specifically, we first build a feature pyramid based on the multi-scale features {C2, C3, C4, C5} from backbone. Then a Region Proposal Network (RPN) is appended to the resulting feature pyramid {P2, P3, P4, P5} to generate numerous RoIs. To conduct Consistent Supervision, each RoI is mapped to all feature levels and the RoI features from each level of {M2, M3, M4, M5} are obtained by RoI-Align [12]. After that, multiple classification and box regression heads are attached to these features to generate auxiliary loss. The parameters of these classification and regression heads are shared across different levels, which can further force different feature maps to learn similar semantic information besides the same supervision signals. For more stable optimization, a weight is used to balance the auxiliary loss generated by Consistent Supervision and the original loss. Formally, the final loss function of rcnn head is formulated as follows:

Lrcnn = λ(Lcls,M(pM, t∗ ) + β[t ∗ > 0]Lloc,M(dM, b∗ )) +Lcls,P (p, t∗ ) + β[t ∗ > 0]Lloc,P (d, b∗ ). (1)

Lcls,M and Lloc,M are objective functions corresponding to the auxiliary loss attached to {M2, M3, M4, M5} while Lcls,P and Lloc,P are original loss functions on feature pyramid {P2, P3, P4, P5}. pM, dM and p, d are the prediction of intermediate layers and final pyramid layers respectively. t ∗ and b ∗ are the groundtruth class label and regression target respectively. λ is the weight used to balance the auxiliary loss and original loss. β is the weight used to balance classification and localization loss. The definition of [t ∗ > 0] is as follows:

[t ∗ > 0] = ( 1, t∗ > 0 0, t∗ = 0 (2)

In the testing phase, the auxiliary branches are abandoned and only the branch after feature pyramid is utilized for final prediction. Therefore, Consistent Supervision introduces no extra parameters and computation to the model in inference.

Consistent Supervision:

The article believes that the adaptation of two convolutions and the direct addition of feature maps of two adjacent scales will make the final feature pyramid sub-optimal from the perspective of parameter optimization (in short, It can be considered that there is a lack of semantic information in the process of fusion, and network learning is not good)

Therefore, an indirect way of thinking is to directly connect a detector & classifier (RPN Head + RCNN) after each feature before fusion, as shown in the yellow part of Figure 2. During training, the loss of the network = lambda * Localization Loss of the detector before fusion + Localization Loss of the detector after fusion + beta * (Classification Loss of the detector before fusion + Classification Loss of the detector after fusion), in actual use lambda = 0.25

In addition, the weights of the detectors of each scale are shared before fusion, which is conducive to the supervision of different scales, so as to: 1) Further strengthen the feature connection at each scale; 2) Reverse the underlying information to learn more semantic information (Guide from high-level information)

In the prediction, these detectors & classifiers shared before fusion can be removed

b) Residual Feature Augmentation:

In FPN, feature map at the highest level M5 is propagated in a top-down path and fused with the feature maps at lower levels {M4, M3, M2} gradually. On the one hand, feature maps of lower levels are enhanced with the semantic information from higher levels and the resulting features are endowed with diverse context information naturally. On the other hand, M5 suffers from the information loss due to the reduced feature channels and only contains single scale context information that is not compatible with the resulting features at other levels.

Based on this observation, we propose Residual Feature Augmentation to improve the feature representation of M5 by utilizing a residual branch to instill diverse spatial context information into the original branch. We expect that the spatial context information can reduce the information loss in channels of M5 and improves performance of the resulting feature pyramid simultaneously. To this end, we first produce multiple context features with different scales of (α1 ×S, α2 ×S, .., αn ×S) by performing ratio-invariant adaptive pooling on C5 whose scale is S. Then each context feature undergoes a 1×1 convolution layer independently to reduce feature channel dimension to 256. Finally, they are upsampled to a scale of S via bilinear interpolation for subsequent fusion. Considering the aliasing effect caused by interpolation, we design a module named Adaptive Spatial Fusion (ASF) to adaptively combine these context features instead of simple summation. The detailed structure of ASF is illustrated in Figure 3(a). Specifically, ASF takes upsampled features as input and produces one spatial weight map for each feature. The weights are used to aggregate the context features into M6, which is endowed with multi-scale context information.

After M6 is generated by ASF, it is combined with M5 by summation and propagated to fuse with other features at lower levels. Finally, a 3 × 3 convolution layer is appended to each feature map to construct a feature pyramid {P2, P3, P4, P5}. Ratio-invariant adaptive pooling is different from PSP [40] in that PSP pools feature into multiple features with fixed sizes while ratio-invariant adaptive pooling takes the ratio of image into account, which is preferable to object detection. Furthermore, we fuse features with Adaptive Spatial Fusion instead of simple summation, which is inferior as shown in the experiments in ablation study.

Residual Feature Augmentation

It can be seen that in the M5 in Figure 2, the number of channels is reduced after the 1x1 convolution, which obviously loses information, and there is no other feature to merge with it.

The article believes that spatial context information can reduce the loss of semantic information caused by the reduction in the number of channels, so it can be compensated by spatial information, and a specific measure of Residual Feature Augmentation is proposed.

Specific practices of Residual Feature Augmentation:

Downsample C5 into 3 copies. The downsampling here includes two parts: one is to downsample C5 to alpha1, alpha2, alpha3 * C5 by adaptive pooling respectively; the other is to pass the result of adaptive pooling through 1x1 convolution, and downsample each copy Feature Channel becomes 256, as shown in Figure 3(a). In actual use, alpha1, alpha2, and alpha3 are respectively 0.1, 0.2, 0.3

Introduction to adaptive pooling: https:// discuss.pytorch.org/t/a

The focus is on how to fuse the sampled features, which is Adaptive Spatial Fusion:

In this article, we first upsample the results of three different scales of downsampling to the same width and height as C5, keeping channel = 256

The three concats are merged by conv 1x1 -> conv 3x3, and finally 3 attention maps are formed (so the dimension of the Sigmoid input should be Nx3xhxw, which is consistent with the code)

And apply these 3 attention maps to the 3 upsampled Feature Maps respectively

c) Soft RoI Selection:

In FPN, feature for each RoI is obtained by pooling on one certain feature level, which is chosen according to the scale of that RoI heuristically. Generally, small RoIs are assigned to features of lower levels while large RoIs are assigned to that of higher levels. Under this strategy, two RoIs with similar sizes may be assigned to different levels. This can produce sub-optimal results because it is ambiguous which feature level contains the most important information of an RoI. It is challenging to design a perfect strategy to allocate the RoIs.

PANet [24] addresses this by pooling RoI features from all levels and using the maximum of RoI features adapted by fully connected layers to refine the proposals. It improves the performance of instance segmentation but the extra fully connected layers increase the parameters significantly. Furthermore, the max operation only selects the feature points with the highest responses and ignores the features with lower responses in other levels that may be also beneficial for recognition. This may impedes the features at different levels from being fully exploited. Therefore, we propose Soft RoI Selection, which learns to generate better RoI features from features at all pyramid levels by parameterizing the procedure of RoI pooling. Soft RoI Selection introduces adaptive weights to better measure the importance of feature inside the RoI region at different levels. The final RoI features are generated based on the adaptive weightes instead of the hard selection approaches like RoI assignment or max operation.

Specifically, we first pool features from all pyramid levels for each RoI. Then instead of adapting the RoI features with fully connected layers like PANet, we exploit an Adaptive Spatial Fusion module (ASF), which is also a component in Residual Feature Augmentation, to fuse these features adaptively. It generates different spatial weight maps for RoI features from different levels and the RoI features are fused with weighted aggregation. ASF only consists of two convolution layers and consumes much fewer parameters than the extra fully connected layers used in PANet. In this way, Soft RoI Selection parameterizes the procedure of RoI pooling. It can be leaned by back-propagation with other components in the network and does not rely on a heuristically designed strategy.

Soft ROI Selection

This article believes that for a given ROI, if it is taken from a certain level of pyramid, it will

For a certain ROI, take out all the ROI Feature in the feature pyramid through ROI Align (4 scales will produce 4 ROI Feature), the default will be 4, 8, 16, 32 4 stride to take out 4 7x7x256 ROI feature

The same way using ASF:

Combine 4 ROI Feature Concats

Fuse through conv 1x1 -> conv 3x3 to form N attention maps (N=number of pyramid layers)

Each ROI Feature is multiplied by the corresponding attention map (weighted)

Add the weighted ROI Feature

2. Experiments:

Dataset and Evaluation Metrics We perform all experiments on the MS COCO detection dataset with 80 categories. It contains 115k images for training (train2017), 5k images for validation (val2017) and 20k images for testing (testdev). The labels of testdev are not released publicly. We train models on train2017 and report results of ablation study on val2017. The final results are reported on testdev. All reported results follow standard COCO-style Average Precision (AP) metrics.

Implementation Details

All experiments are implemented based on mmdetecton [2]. The input images are resized to have a shorter size of 800 pixels. By default, we train the models with 8 GPUs (2 images per GPU) for 12 epochs. The initial learning rate is set as 0.02 and it decreases by a ratio of 0.1 after the 8th and 11th epoch respectively. λ in Equ. 1 is set as 0.25; as for the setting of ratio-invariant adaptive pooling, three alphas α1, α2, α3 with values as 0.1, 0.2, and 0.3 respectively are chosen if not noted specifically. All other hyper-parameters in this paper follow mmdetection.

Main Results

In this section, we evaluate AugFPN on COCO testdev set and compare with other state-of-the-art one-stage and two-stage detectors. For a fair comparison, we reimplement the corresponding baseline methods equipped with FPN. All results are shown in Table 1. By replacing FPN with AugFPN, Faster R-CNN using ResNet50 as backbone (denoted as ResNet50-AugFPN) achieves 38.8 AP, which is 2.3 points higher than Faster R-CNN based on ResNet50-FPN. Besides, AugFPN can consistently bring non-negligible performance even with more powerful backbone networks. For example, when using ResNext101- 32x4d and ResNext101-64x4d as the feature extractors, our method still improves the performance by 1.4 and 1.3 AP, respectively.

Obviously, Faster R-CNN with AugFPN significantly improves FPN when using powerful models like ResNet50 as backbone. Now we test whether AugFPN is suitable for light-weight models, i.e. MobileNet-V2 [34]. As shown in Table 1, Faster R-CNN with MobileNet-v2-AugFPN out performs MobileNet-v2-FPN by 1.6 AP under 1× learning rate schedule.

As for one-stage detectors, we validate the effectiveness of AugFPN on two different types of detectors, i.e. anchorbased RetinaNet [22] and anchor-free FCOS [36]. Since no concept of RoIs exist in these two detectors, Soft RoI Selection is not included in this case. Therefore, the outputs of detectors instead of RPN are used by the Consistent Supervision module in the training phase. As shown in Table 1, RetinaNet can be improved by 1.6 AP and 1.3 AP respectively when using ResNet50 or MobileNet-v2 as backbone. Meanwhile, FCOS is boosted to 37.9 AP from 37.0 AP when replacing FPN with AugFPN. The improvements show that the other two components still improve the feature representation of feature pyramid a lot even without including Soft RoI Selection.

Finally, we evaluate AugFPN on Mask R-CNN. By replacing FPN with AugFPN, Mask RCNN with ResNet50 is improved by 2.0 AP on the detection and 1.9 AP on instance segmentation. When using ResNet101 as backbone, the improvement of AugFPN reaches 1.5 AP on the detection and 1.5 AP on instance segmentation respectively. As can be seen in Table 1, AugFPN brings consistent improvements on various backbones, detectors and even different tasks. This verifies the robustness and generalization ability of AugFPN.

Ablation Study

In this section, we conduct extensive ablation experiments to analyze the effects of individual components in our proposed method.

Ablation studies on importance of each components.

To analyze the importance of each component in AugFpn, Consistent Supervision, Residual Feature Augmentation and Soft RoI Selection are gradually applied to the model to validate the effectiveness. Meanwhile, the improvements brought by combination of different components are also presented to demonstrate that these components are complementary to each other. The baseline method for all ablation studies is Faster R-CNN with ResNet50-FPN.

All results are shown in Table 2. As shown in Table 2, Consistent Supervision improves the baseline method by 0.9 AP. This benefits from that Consistent Supervision narrow semantic gaps between the features after lateral connection and improves their semantic representation simultaneously. It is worthy to note that Consistent Supervision does not introduce extra parameters in inference. Therefore it is cheap to add it to any other FPN based detection models.

Residual Feature Augmentation improves the detection performance from 36.3 to 37.3 AP. It can be seen that results of objects in small, medium and large scale are all improved, which means the complementary information added to M5 also benefits the feature maps at lower levels and improves the feature representation of feature pyramid simultaneously.

Soft RoI Selection brings 0.8 AP improvement to the baseline method. Specifically, the improvements of APm (+1.0 AP) and APl (+0.9 AP) contribute most to the final improvement. These results indicate that adaptive spatial fusion enables larger RoIs, which are originally assigned to higher feature levels, to incorporate features from lower levels that contain more information of spatial details.

When combining any two of three components, the improvement over the baseline method is much higher. For example, Consistent Supervision and Soft RoI Selection together can lead to 1.7 AP improvement. When three components are all integrated into the baseline method, it can achieve 38.7 AP with 2.4 AP improvement. These results indicate that these three components are complementary to each other and tackle different problems in FPN.

Ablation studies on Consistent Supervision.

Experiment results related with three settings of Consistent Supervision are presented in Table 3. The first setting is the baseline method, where λ in Equ. 1 is set as zero. The second setting is single level supervision, which only applies supervision signals to the feature map that RoIs are assigned to according to the assignment strategy of RoIs in FPN [21]. The third setting is all level supervision, which enforces supervision signals to feature maps of all levels.

When using single level supervision, the baseline method can be improved by 0.7 AP by setting λ as 1.0. The improvement becomes smaller when λ is set as other values. By applying supervision signals on feature maps at all levels, all level supervision obtains better results than both single level setting and baseline method. It can be seen that when λ is set as 0.25, all level setting brings 0.5 and 0.9 AP improvement than single level setting and baseline model, respectively. The superiority of all level setting verifies that forcing feature maps at all levels to learn similar semantic information is an effective practice to narrow the semantic gap between them and improves the performance of the resulting feature pyramid.

Ablation studies on Residual Feature Augmentation.

The results of ablation studies on Residual Feature Augmentation are shown in Table 4. We first explore the influence of pooling type by using global pooing instead of ratioinvariant adaptive pooling. Since there is only one branch, Adaptive Spatial Fusion (ASF) is not adopted. Two types of global pooling, Global Max Pooling (GMP) and Global Average Pooling (GAP), are tested in the experiments. From the results shown in Table 4, we observe that GMP is inferior to GAP. GAP improves the baseline method by 0.6 AP while GMP degrades the accuracy instead, which means average pooling is more robust than max pooling because output of max pooling may be disturbed by the peak noises in feature maps greatly.

Based on this observation, we replace GAP with ratioinvariant adaptive average pooling (RA-AP). We firstly choose an α setting of three alphas with values as 0.1, 0.2 and 0.3 respectively. The influence of different α setting will be discussed afterward. For a fair comparison, the pooled context features are directly fused by summation instead of ASF. As shown in the fourth row in Table 4, RA-AP brings 0.8 AP and 0.3 AP improvement over the baseline method and GAP, which validates the effectiveness of diverse context brought by the residual branch. By combining ASF with RA-AP using the same α setting, the final result can be further boosted to 37.3 AP, which is 1.0 AP higher than the baseline method.

The influence of different α setting is also investigated. Although mAP increase as the number of α increases, as can be seen from Table 4, our final model adopts the setting of three alphas for a better trade-off between complexity and accuracy. In addition, we explore how different α values impact the performance and the experimental results are shown in the fourth part of table 4. When values of α are set as other values, the performance is even worse or shows no more improvement. To further validate the effectiveness of RA-AP, we replace RA-AP with PSP [40] that pools feature map into fixed sizes 1 × 1, 2 × 2, 3 × 3. The experimental result shows that it is inferior to RA-AP by 0.3 AP, which verifies that ratio-invariant adaptive pooling can preserve more information beneficial for recognition by not disturbing the original ratio of the objects in features.

Ablation studies on Soft RoI Selection.

We first study different methods of fusing RoI features. The first one is sum fusion and the second one is max fusion. The only difference between max fusion in this setting and adaptive pooling in PANet [24] is that we do not introduce extra fully connected layers to adapt the RoI features because it would significantly increases the parameters. The third one is the Adaptive Channel Fusion (ACF) as shown in Figure 3(b). It is inspired by the SE module [14] but with a different goal of fusing different RoI features from the perspective of channel importance. The fourth one is the Adaptive Spatial Fusion (ASF) module as shown in Figure 3(a). Experimental results on these methods are shown in Table 5.

From the results we can observe that sum fusion and max fusion improve baseline method by 0.3 and 0.2 AP respectively. By using ACF to fuse RoI features adaptively, the baseline method obtains 0.7 AP improvement. When ACF is replaced with ASF, which is the setting of Soft RoI Selection, the final model achieves 37.1 AP and outperforms the baseline method by 0.8 AP. These results indicate that by enabling the procedure of RoI feature selection to learn with other components, Soft RoI Selection can produce more powerful representations of RoIs.

In order to analyze the ratios of features at different levels absorbed by ASF, we divide RoI proposals on val2017 into four levels according to the levels they are originally assigned to. For each RoI, we average over all positions on each weight map generated by ASF and obtain four ratios corresponding to four feature levels. Finally, for all RoIs that belong to one certain level, four ratio values are separately averaged over these RoIs. The results corresponding to four pyramid levels are illustrated in Figure 4. Obviously, features from all levels contribute together to generate better RoI features, which indicates that features from all levels are beneficial for the recognition of each RoI. It can be seen that the RoIs originally assigned to level P2 still requires more semantic information from P5 beside the information propagated from higher levels. Meanwhile, the RoIs originally assigned to P3-5 all requires much detailed appearance information from P2, which may be lost due to down-sampling.

Runtime Analysis

We also measure the time of training and testing when FPN is replaced with AugFPN. Specifically, the training time of Faster-RCNN with ResNet50-AugFPN is about 1.1 hour and that of Faster-RCNN with ResNet50-FPN is nearly 0.9 hour for each epoch on COCO dataset with the same batch size of 16. As for the inference time, AugFPN can run at 11.1 fps and FPN can run at 13.4 fps for images with a shorter size of 800 pixels. The inference time is the average inference time over COCO val5000 split including the time of data loading, network forwarding, and post-processing. All the runtimes are tested on Tesla V100.

References:

Feature Pyramid Networks for Object Detection
2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
Feature Selective Networks for Object Detection
2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
A New Feature Pyramid Network for Object Detection
2019 International Conference on Virtual Reality and Intelligent Systems (ICVRIS)
NAS-FPN: Learning Scalable Feature Pyramid Architecture for Object Detection
2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
Weighted Feature Pyramid Networks for Object Detection
2019 IEEE Intl Conf on Parallel & Distributed Processing with Applications, Big Data & Cloud Computing, Sustainable Computing & Communications, Social Computing & Networking (ISPA/BDCloud/SocialCom/SustainCom)
DG-FPN: Learning Dynamic Feature Fusion Based on Graph Convolution Network For Object Detection
2020 IEEE International Conference on Multimedia and Expo (ICME)
Scale-Equalizing Pyramid Convolution for Object Detection
2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
Rethinking Classification and Localization for Object Detection
2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
Learning Richer Features in Deep CNN for Object Detection
2020 International Conference on Artificial Intelligence and Computer Engineering (ICAICE)
SFPN: Semantic Feature Pyramid Network for Object Detection
2020 25th International Conference on Pattern Recognition (ICPR)

Page updated

Google Sites

Report abuse

AugFPN: Improving Multi-Scale Feature Learning for Object Detection

About Me: