FPN makes use of the in-network feature hierarchy that produces feature maps with different resolutions to build a feature pyramid. In order to integrate the multi-scale context information, FPN fuses features of different scales by upsampling and summation in a top down path. However, the features with different scales contain information at different abstract levels and there exist large semantic gaps between them. Although the method adopted by FPN is simple and effective, fusing multiple features with large semantic gaps would lead to a sub-optimal feature pyramid.
This inspires us to propose Consistent Supervision, which enforces the same supervision signals on the multi-scale features before fusion, with the goal of narrowing semantic gaps between them. Specifically, we first build a feature pyramid based on the multi-scale features {C2, C3, C4, C5} from backbone. Then a Region Proposal Network (RPN) is appended to the resulting feature pyramid {P2, P3, P4, P5} to generate numerous RoIs. To conduct Consistent Supervision, each RoI is mapped to all feature levels and the RoI features from each level of {M2, M3, M4, M5} are obtained by RoI-Align [12]. After that, multiple classification and box regression heads are attached to these features to generate auxiliary loss. The parameters of these classification and regression heads are shared across different levels, which can further force different feature maps to learn similar semantic information besides the same supervision signals. For more stable optimization, a weight is used to balance the auxiliary loss generated by Consistent Supervision and the original loss. Formally, the final loss function of rcnn head is formulated as follows:
Lrcnn = λ(Lcls,M(pM, t∗ ) + β[t ∗ > 0]Lloc,M(dM, b∗ )) +Lcls,P (p, t∗ ) + β[t ∗ > 0]Lloc,P (d, b∗ ). (1)
Lcls,M and Lloc,M are objective functions corresponding to the auxiliary loss attached to {M2, M3, M4, M5} while Lcls,P and Lloc,P are original loss functions on feature pyramid {P2, P3, P4, P5}. pM, dM and p, d are the prediction of intermediate layers and final pyramid layers respectively. t ∗ and b ∗ are the groundtruth class label and regression target respectively. λ is the weight used to balance the auxiliary loss and original loss. β is the weight used to balance classification and localization loss. The definition of [t ∗ > 0] is as follows:
[t ∗ > 0] = ( 1, t∗ > 0 0, t∗ = 0 (2)
In the testing phase, the auxiliary branches are abandoned and only the branch after feature pyramid is utilized for final prediction. Therefore, Consistent Supervision introduces no extra parameters and computation to the model in inference.