[TPH-YOLOv5] Improved YOLOv5 Based on Transformer Prediction Head for Object Detection on Drone-captured Scenarios
Beihang University
{Dense Objects in Drones Images, 4 TPH Heads, Transformer Encoder, CBAM, Model Ensemble, Self-trained Classfier}
Beihang University
{Dense Objects in Drones Images, 4 TPH Heads, Transformer Encoder, CBAM, Model Ensemble, Self-trained Classfier}
Object detection on drone-captured scenarios is a recent popular task.
Difficulty:
As drones always navigate in different altitudes, the object scale varies violently.
High-speed and low-altitude flight bring in the motion blur on the densely packed objects.
Figure. 3 Main Problems of Object Detection in Drone Images
One more prediction head.
Transformer Prediction Heads (TPH).
Convolutional Block Attention Model (CBAM).
Bags of useful strategies.
Detection tasks using deep convolutional neural networks [40, 37, 34, 27, 58].
YOLO series [37, 38, 39, 2].
One-stage detectors: YOLOX [11], FCOS [48], DETR [65], Scaled-YOLOv4 [51], EfficientDet [45].
Two-stage detectors: VFNet [59], CenterNet2 [62].
Anchor-based detectors: Scaled- YOLOv4 [51], YOLOv5 [21].
Anchor-free detectors: CenterNet [63], YOLOX [11], RepPoints [55].
Some detectors are specially designed for Drone-captured images like RRNet [4], PENet [46], and CenterNet [63].
VGG [42], ResNet [17], DenseNet [20], MobileNet [19], EfficientNet [44], CSPDarknet53 [52], Swin Transformer [35].
FPN [28], PANet [33], NAS-FPN [12], BiFPN [45], ASFF [32], SFAM [61].
The head is designed to be responsible for detecting the location and category of the object by the features maps extracted from the backbone.
YOLO series [37, 38, 39, 2], SSD [34] and RetinaNet [29].
Deep Learning provides greater flexibility and can scale in proportion to the amount of training data.
Disadvantage:
Sensitive to the details of the training data
Find a different set of weights each time they train
Resulting in different predictions.
This gives the neural network a high variance.
Propose: train multiple models instead of a single model, and combine the predictions of these models.
Methods to ensemble boxes from different object detection models:
Non-maximum suppression (NMS) [36]
Soft-NMS [53]: Sets an attenuation function for the confidence of adjacent bounding boxes based on the IoU value instead of completely setting their confidence scores to zero and deleting them.
Weighted box fusion (WBF) [43]: Merges all boxes to form the final result.
Based on YOLOv5, one more prediction head is added to detect different-scale objects (for tiny objects).
The original prediction heads are replaced with Transformer Prediction Heads (TPH) to explore the prediction potential with a self-attention mechanism.
The Convolutional Block Attention Model (CBAM) is also integrated to find attention regions in scenarios with dense objects.
To achieve more improvement, bags of useful strategies are provided such as data augmentation, multiscale testing, and multi-model integration.
Extra self-trained classifiers (ResNet18) on confusing categories.
CSPDarknet53 + SPP + PANet + TPH heads.
Photometric (Hue, Saturation, and Values) and Geometric Distortions (Scale, Crop, Translate, Shear, and Rotate).
MixUp [57].
CutMix [56].
Mosaic [2].
Replace some convolutional blocks and CSP bottleneck blocks in the original version of YOLOv5 with transformer encoder blocks.
The transformer encoder block can capture global information and abundant contextual information.
Transformer encoder blocks increase the ability to capture different local information. It can also explore the feature representation potential with a self-attention mechanism [50].
On the Vis-Drone2021 dataset, transformer encoder blocks have better performance on occluded objects with high density.
Based on YOLOv5, we only apply transformer ender blocks in the head part to form the Transformer Prediction Head (TPH) and the end of the backbone.
Because the feature maps at the end of the network have low resolution. Applying TPH on low-resolution feature maps can decrease the expensive computation and memory costs.
When enlarging the resolution of input images, optionally remove some TPH blocks at early layers to make the training process available.
Each transformer encoder contains two sub-layers.
A multi-head attention layer.
A fully-connected layer.
Residual connections are used between each sub-layer.
CBAM is a lightweight module that can be integrated into most notable CNN architectures, and it can be trained in an end-to-end manner.
Given a feature map, CBAM sequentially infers the attention map along two separate dimensions of channel and spatial and then multiplies the attention map with the input feature map to perform adaptive feature refinement.
The performance of the model greatly improved, which proves the effectiveness of this module.
On drone-captured images, large covering region always contains confusing geographical elements. Using CBAM can extract the attention area to help TPH-YOLOv5 resist confusing information and focus on useful target objects.
We train five different models in terms of different perspectives for the model ensemble.
During the inference phase, we first perform a ms-testing strategy on a single model.
Scaling the testing image to 1.3 times.
Respectively reducing the image to 1 time, 0.83 times, and 0.67 times.
Flipping the images horizontally.
Feed the six different-scaling images to TPH-YOLOv5 and use NMS to fuse the testing predictions.
On different models, we perform the same ms-testing operation and fuse the final five predictions by WBF to get the final result.
TPH-YOLOv5 has excellent localization ability but poor classification ability. The precision of some hard categories such as tricycle and awning-tricycle are very low.
Propose an extra self-trained classifier.
Construct a training set by cropping the ground-truth bounding boxes and resizing each image patch to 64×64.
Select ResNet18 [17] as classifier network.
In the training phase, we use part of the pre-trained model from yolov5x, because TPH-YOLOv5 and YOLOv5 share most part of the backbone (block 0 ̃8) and some part of the head (block 10 ̃13 and block 15 ̃18).
65 epochs, and the first 2 epochs are used for warm-up.
Adam optimizer for training, and use 3e-4 as the initial learning rate with the cosine learning rate schedule. The lr of the last epoch decays to 0.12 of the initial.
The size of the input image is 1536 pixels, which leads to the batch size is only 2.
Use different input image sizes and change the weight of each category to make each model unique. So that the final ensemble model can get a relatively balanced result.
TPH-YOLOv5-1 uses the input image size of 1920 and all categories have equal weights.
TPH-YOLOv5- 2 uses the input image size of 1536 and all categories have equal weights.
TPH-YOLOv5-3 use the input image size of 1920 and the weight of each category is related to the number of labels, which is shown in Fig. 8. The more la- bels of a certain category, the lower the weight it is given.
TPH-YOLOv5-4 use the input image size of 1536 and the weight of each category is related to the number of labels.
TPH-YOLOv5-5 use the backbone of YOLOv5l and use the input image size of 1536.
VisDrone2021 Dataset.
mAP