[YOLOv6]YOLOv6: A Single-Stage Object Detection Framework for Industrial
Applications
Meituan Inc.
{, }
Meituan Inc.
{, }
For years, YOLO series have been de facto industry-level standard for efficient object detection.
The YOLO community has prospered overwhelmingly to enrich its use in a multitude of hardware platforms and abundant scenarios.
Assimilate ideas from recent network design, training strategies, testing techniques, quantization and optimization methods.
Refashion a line of networks of different sizes tailored for industrial applications in diverse scenarios.
Imbue YOLOv6 with a self-distillation strategy, performed both on the classification task and the regression task.
Verify the advanced detection techniques for label assignment, loss function and data augmentation techniques.
Reform the quantization scheme for detection with the help of RepOptimizer [2] and channel-wise distillation [36].
Several important factors:
Reparameterization from RepVGG [3].
Quantization of reparameterization-based detectors.
Latencies in deployment.
Label assignment and loss function.
Training strategy, such as knowledge distillation.
Figure. Fusion process of Rep operator.
The RepVGG Style structure is a reparameterizable structure that has a multi-branch topology during training and can be equivalently fused into a single 3x3 convolution during actual deployment.
Through the fused 3x3 convolution structure, the computing power of computationally intensive hardware (such as GPU) can be effectively utilized, and the help of the highly optimized NVIDIA cuDNN and Intel MKL compilation frameworks on GPU/CPU can also be obtained.
The renovated design of YOLOv6 consists of the following components:
Network design.
Label assignment.
Loss function.
Data augmentation.
Industry-handy improvements.
Quantization and deployment/
Based on EfficientNet-L2
Mosaic [1,10] and Mixup [49]
following [1,7,10]
Problem:
Multi-branch networks achieve better classification performance => reduce the parallelism => increase of inference latency.
Single-path networks (VGG) take advantages of high parallelism and less memory footprint => higher inference efficiency.
RepVGG => hardly to be scaled.
Propose EfficientRep backbone:
For small networks, take RepBlock as the building blocks during the training phase. Each RepBlock is converted to stacks of 3 × 3 convolutional layers (denoted as RepConv) with ReLU activation functions during the inference phase.
For medium and large models, CSPStackRep Block. Each composed of three 1×1 convolution layers and a stack of sub-blocks consisting of two RepVGG blocks [3] or RepConv (at training or inference respectively) with a residual connection. Besides, a cross stage partial (CSP) connection is adopted to boost performance without excessive computation cost.
Figure. EfficientRep Backbone structure diagram
We replaced the ordinary Conv layer with stride=2 in Backbone with the RepConv layer with stride=2.
At the same time, the original CSP-Block is redesigned into RepBlock, where the first RepConv of RepBlock will transform and align the channel dimension.
In addition, we also optimized and designed the original SPPF into a more efficient SimSPPF.
Adopts PAN topology.
Enhance Neck with RepBlocks or CSPStackRep Blocks.
Propose Rep-PAN: Replace the CSPBlock used in YOLOv5 with RepBlock (for small models) or CSPStackRep Block (for large models) and adjust the width and depth accordingly
Figure. Rep-PAN structure diagram
Rep-PAN is based on the PAN [6] topology and replaces the CSP-Block used in YOLOv5 with RepBlock. At the same time, the operators in the overall Neck are adjusted.
The purpose is to achieve efficient reasoning on the hardware while maintaining good performance. Multi-scale feature fusion capability.
YOLOx: Decouple classifcation and localization branches and introduce additional two 3x3 convolutional layers in each branch.
Propose Efficient Decoupled Head:
Adopt a hybrid-channel strategy.
Reduce the number of the middle 3x3 convolutional layers to only one.
The width of the head is jointly scaled by the width multiplier for the backbone and the neck.
There are two types of anchor-free detectors:
Anchor point-based [7, 41]
Keypoint-based [16, 46, 53].
In YOLOv6, we adopt the anchor point-based paradigm, whose box regression branch actually predicts the distance from the anchor point to the four sides of the bounding boxes.
Label assignment is responsible for assigning labels to predefined anchors during the training stage.
SimOTA will slow down the training process. And it is not rare to fall into unstable training.
Task Alignment Learning (TAL): First proposed in TOOD [5], in which a unified metric of classification score and predicted box quality is designed. The IoU is replaced by this metric to assign object labels. To a certain extent, the problem of the misalignment of tasks (classification and box regression) is alleviated.
Classification Loss:
VariFocal Loss [50] treats the positive and negative samples asymmetrically.
By considering positive and negative samples at different degrees of importance, it balances learning signals from both samples.
Box Loss:
SIoU is applied to YOLOv6-N and YOLOv6-T.
Others use GIoU.
Adopt DFL in YOLOv6-M/L.
Object Loss:
As an anchor-free framework like FCOS and YOLOX, we have tried object loss into YOLOv6.
Unfortunately, it doesn’t bring many positive effects.
More training epochs. Extend the training duration from 300 epochs to 400 epochs to reach a better convergence.
Self-distillation: both classification and regression are respectively supervised by the teacher model.
Based on DFL [20], distillation of box regression is made.
The proportion of information from the soft and hard labels is dynamically declined via cosine decay, which helps the student selectively acquire knowledge at different phases during the training process.
The knowledge distillation loss can then be formulated as:
LKD = KL(ptcls || pscls ) + KL(ptreg || psreg),
ptcls || pscls are class prediction of the teacher model and the student model.
ptreg || psreg are box regression predictions.
The overall loss function is now formulated as:
Ltotal = Ldet + αLKD,
where Ldet is the detection loss computed with predictions and labels.
The hyperparameter α is introduced to balance two losses.
In the early stage of training, the soft labels from the teacher are easier to learn. As the training continues, the performance of the student will match the teacher so that the hard labels will help students more.
Upon this, we apply cosine weight decay to α to dynamically adjust the information from hard labels and soft ones from the teacher.
Gray border of images:
We notice that a half-stride gray border is put around each image when evaluating the model performance in the implementations of YOLOv5 [10] and YOLOv7 [42]. Although no useful information is added, it helps in detecting the objects near the edge of the image. This trick also applies in YOLOv6.
However, the extra gray pixels evidently reduce the inference speed. Without the gray border, the performance of YOLOv6 deteriorates, which is also the case in [10, 42]. We postulate that the problem is related to the gray borders padding in Mosaic augmentation [1, 10]. Experiments on turning mosaic augmentations off during last epochs [7] (aka. fade strategy) are conducted for verification. In this regard, we change the area of gray border and resize the image with gray borders directly to the target image size. Combining these two strategies, our models can maintain or even boost the performance without the degradation of inference speed.
Counter the problem of the impaired performance without adding extra gray borders at evaluation.
For industrial deployment, it has been common practice to adopt quantization to further speed up runtime without much performance compromise. Post-training quantization (PTQ) directly quantizes the model with only a small calibration set. Whereas quantization-aware training (QAT) further improves the performance with the access to the training set, which is typically used jointly with distillation. However, due to the heavy use of re-parameterization blocks in YOLOv6, previous PTQ techniques fail to produce high performance, while it is hard to incorporate QAT when it comes to matching fake quantizers during training and inference. We here demonstrate the pitfalls and our cures during deployment.
Reparameterizing Optimizer
Train YOLOv6 with RepOptimizer [2] to obtain PTQ-friendly weights.
RepOptimizer [2] proposes gradient re-parameterization at each optimization step. This technique also well solves the quantization problem of reparameterization-based models. We hence reconstruct the re-parameterization blocks of YOLOv6 in this fashion and train it with RepOptimizer to obtain PTQ-friendly weights. The distribution of feature map is largely narrowed (e.g. Fig. 4, more in B.1), which greatly benefits the quantization process, see Sec 3.5.1 for results.
Sensitivity Analysis
We further improve the PTQ performance by partially converting quantization-sensitive operations into float computation. To obtain the sensitivity distribution, several metrics are commonly used, mean-square error (MSE), signal-noise ratio (SNR) and cosine similarity. Typically for comparison, one can pick the output feature map (after the activation of a certain layer) to calculate these metrics with and without quantization. As an alternative, it is also viable to compute validation AP by switching quantization on and off for the certain layer [29]. We compute all these metrics on the YOLOv6-S model trained with RepOptimizer and pick the top-6 sensitive layers to run in float. The full chart of sensitivity analysis can be found in B.2.
Quantization-aware Training with Channel-wise Distillation
Adopt QAT with channel-wise distillation [36] and graph optimization to pursue extreme performance.
In case PTQ is insufficient, we propose to involve quantization-aware training (QAT) to boost quantization performance. To resolve the problem of the inconsistency of fake quantizers during training and inference, it is necessary to build QAT upon the RepOptimizer. Besides, channelwise distillation [36] (later as CW Distill) is adapted within the YOLOv6 framework, shown in Fig. 5. This is also a self-distillation approach where the teacher network is the student itself in FP32-precision. See experiments in Sec 3.5.1.
Use the same optimizer and the learning schedule as YOLOv5 [10]
Stochastic gradient descent (SGD) with momentum and cosine decay on learning rate.
Warm-up, grouped weight decay strategy and the exponential moving average (EMA).
A complete list of hyperparameter settings can be found in the released code.
We train our models on the COCO 2017 [23] training set, and the accuracy is evaluated on the COCO 2017 validation set.
mAP.
Latency.
Flops
Compared with YOLOv5-N/YOLOv7-Tiny (input size=416), our YOLOv6-N has significantly advanced by 7.9%/2.6% respectively. It also comes with the best speed performance in terms of both throughput and latency. Compared with YOLOX-S/PPYOLOE-S, YOLOv6-S can improve AP by 3.0%/0.4% with higher speed. We compare YOLOv5-S and YOLOv7-Tiny (input size=640) with YOLOv6-T, our method is 2.9% more accurate and 73/25 FPS faster with a batch size of 1. YOLOv6-M outperforms YOLOv5-M by 4.2% higher AP with a similar speed, and it achieves 2.7%/0.6% higher AP than YOLOX-M/PPYOLOE-M at a higher speed. Besides, it is more accurate and faster than YOLOv5-L. YOLOv6-L is 2.8%/1.1% more accurate than YOLOX-L/PPYOLOE-L under the same latency constraint. We additionally provide a faster version of YOLOv6-L by replacing SiLU with ReLU (denoted as YOLOv6-L-ReLU). It achieves 51.7% AP with a latency of 8.8 ms, outperforming YOLOX-L/PPYOLOE-L/YOLOv7 in both accuracy and speed.
YOLOv6 compared to YOLOv5 and YOLOX (small models)
https://vinbigdata.com/kham-pha/yolo-v7-thuat-toan-phat-hien-doi-tuong-co-gi-moi.html
YOLOX: Exceeding YOLO Series in 2021, https://arxiv.org/abs/2107.08430
PP-YOLOE: An evolved version of YOLO, https://arxiv.org/abs/2203.16250
RepVGG: Making VGG-style ConvNets Great Again, https://arxiv.org/pdf/2101.03697
CSPNet: A New Backbone that can Enhance Learning Capability of CNN, https://arxiv.org/abs/1911.11929
Path aggregation network for instance segmentation, https://arxiv.org/abs/1803.01534
OTA: Optimal Transport Assignment for Object Detection, https://arxiv.org/abs/2103.14259
Computer Architecture: A Quantitative Approach
SIoU Loss: More Powerful Learning for Bounding Box Regression, https://arxiv.org/abs/2205.12740
n2 n0
θ