[TPH-YOLOv5++] Boosting Object Detection on Drone-Captured Scenarios with Cross-Layer Asymmetric Transformer

Beihang University

{Cross-layer Asymmetric Transformer - CA-Trans, Sparse Local Attentio - SLA}

Paper: https://www.mdpi.com/2072-4292/15/6/1687

Code:

Motivation, Objectives and Related Works

Motivation

Object detection in drone-captured images is a popular task in recent years.

Objectives

TPH-YOLOv5++ is proposed to significantly reduce the computational cost and improve the detection speed of TPH-YOLOv5.
In TPH-YOLOv5++:
1. Cross-layer asymmetric transformer (CA-Trans) is designed to replace the additional prediction head while maintain the knowledge of this head.
2. By using a sparse local attention (SLA) module, the asymmetric information between the additional head and other heads can be captured efficiently, enriching the features of other heads.

Related Works

Object Detection in Drone-Captured Images

ClusDet [32] unifies object clustering and detection in an end-to-end framework by sequentially finding clustered areas and detecting objects in these areas.
Zhang et al. [33] proposed a difficult region estimation network to find a difficult high-density area for further detection. Aiming to address vehicle detection challenges caused by diversity in drone-captured images.
AdNet [34] seeks to align features between different viewpoints, illumination, weather, and background following the idea of domain adaptation.
Tiny-scale objects and unevenly distributed objects severely hinder the performance of detection models as discussed in ClusDet [32].
GLSAN [35] adds an efficient self-adaptive region, selecting an algorithm for the global–local detection network, finding high-density areas, and detecting objects with large size variation accurately.
DMNet [36] proposes a novel crop strategy guided by a density map, removing the area without objects and balancing the information of the foreground and background.
Yu et al. [37] analyzed the detection results of DMNet [36] and found that it has an explicit performance degradation on a long-tail scene. They designed a DSHNet [37] to handle head and tail classes separately by combining class-biased samplers and bilateral box heads.
MDCT [38] designs a multi-kernel dilated convolution (MDC) block and transformer block to identify small objects in dense scenes.
Gallo et al. [39] utilized the YOLOv7 model to solve the challenge caused by the existence of unstructured crop conditions and the high biological variation of weeds.
RAANet [40] constructs a new residual ASPP by embedding the attention module and residual structure into the ASPP, to deal with the variability and complex background problems of land use in high-resolution imagery.
HawkNet [41] proposes an up-scale feature aggregation framework to fully utilize multi-scale complementary information.
CDMNet [42] formats density maps into coarse-grained form and designs a lightweight dual-task density estimation network.
FiFoNet [43] effectively selects a combination of multi-scale features for an object and block background interference, which further revitalizes the differentiability of the multi-scale feature representation.
TPH-YOLOv5 [20] combines a transformer-based prediction head and the YOLOv5 detection model, realizing significant performance improvement in large size variation and high-density scenes.
UAV-Net [44] analyzes influences from different backbone architectures, prediction heads, and model pruning methods comprehensively and constructs a better combination to realize fast object detection.
GDFNet [45] uses a global density model to jointly extract density information from multiple-level pyramid features, which is faster than most models based on pyramid feature fusion architecture.
RHFNet [46] utilizes a bidirectional fusion architecture to fully use multi-layer features, efficiently realizing small object detection.
HSD [47] proposes a novel reg-offset-cls module and a stacked strategy implementing precision and speed at the same time.
Integrating the specialized feature extraction and information fusion techniques, SODNet [48] effectively improves small object detection ability with high real-time performance.
Dividing the high-resolution input image into a number of chips still introduces a high computational cost, so UFPMPDet [49] merges sub-regions given by a coarse detector into a mosaic for a single inference, further promoting the efficiency of detection.

Vision Transformer

Swin–Transformer [50] uses a shifted windowing scheme to realize efficient computation of Transformer by limiting self-attention computation to nonoverlapping local windows while also allowing for cross-window connection.
To further expand the cross-window connection, CSWin–Transformer [51] proposes cross-shaped window self-attention mechanisms for computing self-attention in the horizontal and vertical stripes, promoting the connection in the global perspective.
CrossViT [52] proposes a novel Transformer-based module that can be used between features with different spatial sizes. CrossViT first utilizes two ViTs for two features separately. Then, CrossViT exchanges the class tokens of two features and extracts cross attention between two features.
V2X-ViT [53] proposes V2X communication using a novel vision Transformer to achieve accurate 3D object detection.
CoBEVT [54] designs a fused axial attention module (FAX) to realize bird’s eye view semantic segmentation.
MaxViT [55] consists of two aspects: blocked local and dilated global attention, which allows for global–local spatial interactions on arbitrary input resolutions with only linear complexity. Transformer is also widely used in image object detection tasks.
DETR [56] proposes an end-to-end architecture for object detection by regarding the task as a direct set prediction problem. However, in DETR, each object query will not focus on a specifc region.
Anchor DETR [57] proposes a query design and an attention variant to make the object query focus on the objects near the anchor point.
YOLOS [58] proposes a series of Transformer-based object detection models.

Idea

In Figure 4, we can see that the additional prediction head (the “Tiny Prediction Head”) produces plenty of wrong boxes with relatively large confidence, especially between 0.2 and 0.6.
In Figure 5, if the average confidence is high, the color tends toward red, otherwise it tends towards blue. We can obviously find that the additional head indeed improves the performance for high-density scenes and large-size variations. However, the small prediction head also captures objects of considerable proportions that are contained by the results predicted by the additional head.
Design a cross-layer asymmetric transformer (CA-Trans) to enrich the feature of small paths with the help of tiny paths.

CA-Trans

Remove the additional prediction head and introduce CA-Trans between tiny paths and small paths.
To extract the asymmetric information, which means the information that the additional head can capture but the small prediction head ignores.
CA — Trans( , ) denotes the cross-layer asymmetric Transformer module.

Cross-Layer Feature Enrichment Transformer

Taking f1 and f’2 as inputs and getting f’’2 .

Then, K and Q are generated from f1:

Q is generated from f’2:

Sparse Local Attention

After obtaining Q, K, and V, we use the sparse local attention (SLA) module to calculate sparse attention between features of two different layers.

First, the neighborhood area of Kp for each pixel in Q is obtained. Therefore, a neighborhood set for Q is obtained:

where M is the set of neighborhood features of all pixels in Q, and Mij contains the (i, j) neighborhood feature of each pixel in Q.
The transformation of each pixel from Kp to Mij:

where u and v are the coordinates of pixel in Mij.
All the neighborhood features are concatenated as Ksparse:

The sparse relation map R is then calculated:

where dQ is the channel dimension of Q.
Then asymmetric feature extractor (AFE) is used to obtain the asymmetric map:

where the value at each location of A is:

where Ri denotes the average of the i-th row of R.
The asymmetric map is applied to V to extract the features:

where Fa denotes the asymmetric feature output from the SLA and softmax(.) is the softmax operation along each row of A.
Finally, the output Fout of CA-Trans can be formulated as: