[YOLOv6]YOLOv6: A Single-Stage Object Detection Framework for Industrial

Applications

Meituan Inc.

{, }

Paper: https://arxiv.org/pdf/2209.02976.pdf

Code: https://github.com/meituan/YOLOv6

Motivation, Objectives and Related Works

Motivation

For years, YOLO series have been de facto industry-level standard for efficient object detection.
The YOLO community has prospered overwhelmingly to enrich its use in a multitude of hardware platforms and abundant scenarios.
Assimilate ideas from recent network design, training strategies, testing techniques, quantization and optimization methods.

Objectives - YOLOv6

Refashion a line of networks of different sizes tailored for industrial applications in diverse scenarios.
Imbue YOLOv6 with a self-distillation strategy, performed both on the classification task and the regression task.
Verify the advanced detection techniques for label assignment, loss function and data augmentation techniques.
Reform the quantization scheme for detection with the help of RepOptimizer [2] and channel-wise distillation [36].

Related Works

Several important factors:
1. Reparameterization from RepVGG [3].
2. Quantization of reparameterization-based detectors.
3. Latencies in deployment.
4. Label assignment and loss function.
5. Training strategy, such as knowledge distillation.

Figure. Fusion process of Rep operator.

The RepVGG Style structure is a reparameterizable structure that has a multi-branch topology during training and can be equivalently fused into a single 3x3 convolution during actual deployment.

Through the fused 3x3 convolution structure, the computing power of computationally intensive hardware (such as GPU) can be effectively utilized, and the help of the highly optimized NVIDIA cuDNN and Intel MKL compilation frameworks on GPU/CPU can also be obtained.

Model

Idea

The renovated design of YOLOv6 consists of the following components:
1. Network design.
2. Label assignment.
3. Loss function.
4. Data augmentation.
5. Industry-handy improvements.
6. Quantization and deployment/

Architecture

Based on EfficientNet-L2

Data Augmentation

Mosaic [1,10] and Mixup [49]
following [1,7,10]

Backbone

Problem:
1. 1. Multi-branch networks achieve better classification performance => reduce the parallelism => increase of inference latency.
  2. Single-path networks (VGG) take advantages of high parallelism and less memory footprint => higher inference efficiency.
  3. RepVGG => hardly to be scaled.
Propose EfficientRep backbone:
1. 1. For small networks, take RepBlock as the building blocks during the training phase. Each RepBlock is converted to stacks of 3 × 3 convolutional layers (denoted as RepConv) with ReLU activation functions during the inference phase.
  2. For medium and large models, CSPStackRep Block. Each composed of three 1×1 convolution layers and a stack of sub-blocks consisting of two RepVGG blocks [3] or RepConv (at training or inference respectively) with a residual connection. Besides, a cross stage partial (CSP) connection is adopted to boost performance without excessive computation cost.

Figure. EfficientRep Backbone structure diagram

We replaced the ordinary Conv layer with stride=2 in Backbone with the RepConv layer with stride=2.

At the same time, the original CSP-Block is redesigned into RepBlock, where the first RepConv of RepBlock will transform and align the channel dimension.

In addition, we also optimized and designed the original SPPF into a more efficient SimSPPF.

Neck

Adopts PAN topology.
Enhance Neck with RepBlocks or CSPStackRep Blocks.
Propose Rep-PAN: Replace the CSPBlock used in YOLOv5 with RepBlock (for small models) or CSPStackRep Block (for large models) and adjust the width and depth accordingly

Figure. Rep-PAN structure diagram

Rep-PAN is based on the PAN [6] topology and replaces the CSP-Block used in YOLOv5 with RepBlock. At the same time, the operators in the overall Neck are adjusted.

The purpose is to achieve efficient reasoning on the hardware while maintaining good performance. Multi-scale feature fusion capability.

Head

YOLOx: Decouple classifcation and localization branches and introduce additional two 3x3 convolutional layers in each branch.
Propose Efficient Decoupled Head:
1. 1. Adopt a hybrid-channel strategy.
  2. Reduce the number of the middle 3x3 convolutional layers to only one.
  3. The width of the head is jointly scaled by the width multiplier for the backbone and the neck.

Figure 6 Efficient Decoupled Head structure diagram

Anchor-free

There are two types of anchor-free detectors:
- 1. Anchor point-based [7, 41]
  2. Keypoint-based [16, 46, 53].
- In YOLOv6, we adopt the anchor point-based paradigm, whose box regression branch actually predicts the distance from the anchor point to the four sides of the bounding boxes.

Label Assignment

Label assignment is responsible for assigning labels to predefined anchors during the training stage.
SimOTA will slow down the training process. And it is not rare to fall into unstable training.
Task Alignment Learning (TAL): First proposed in TOOD [5], in which a unified metric of classification score and predicted box quality is designed. The IoU is replaced by this metric to assign object labels. To a certain extent, the problem of the misalignment of tasks (classification and box regression) is alleviated.

Loss Function

Classification Loss:
- VariFocal Loss [50] treats the positive and negative samples asymmetrically.
- By considering positive and negative samples at different degrees of importance, it balances learning signals from both samples.
Box Loss:
- SIoU is applied to YOLOv6-N and YOLOv6-T.
- Others use GIoU.
- Adopt DFL in YOLOv6-M/L.
Object Loss:
- As an anchor-free framework like FCOS and YOLOX, we have tried object loss into YOLOv6.
- Unfortunately, it doesn’t bring many positive effects.

Industry-handy Improvements

More training epochs. Extend the training duration from 300 epochs to 400 epochs to reach a better convergence.
Self-distillation: both classification and regression are respectively supervised by the teacher model.
- Based on DFL [20], distillation of box regression is made.
- The proportion of information from the soft and hard labels is dynamically declined via cosine decay, which helps the student selectively acquire knowledge at different phases during the training process.
- The knowledge distillation loss can then be formulated as:

LKD = KL(ptcls || pscls ) + KL(ptreg || psreg),

- ptcls || pscls are class prediction of the teacher model and the student model.
- ptreg || psreg are box regression predictions.
- The overall loss function is now formulated as:

Ltotal = Ldet + αLKD,

- where Ldet is the detection loss computed with predictions and labels.
- The hyperparameter α is introduced to balance two losses.
- In the early stage of training, the soft labels from the teacher are easier to learn. As the training continues, the performance of the student will match the teacher so that the hard labels will help students more.
- Upon this, we apply cosine weight decay to α to dynamically adjust the information from hard labels and soft ones from the teacher.
Gray border of images:
- We notice that a half-stride gray border is put around each image when evaluating the model performance in the implementations of YOLOv5 [10] and YOLOv7 [42]. Although no useful information is added, it helps in detecting the objects near the edge of the image. This trick also applies in YOLOv6.
- However, the extra gray pixels evidently reduce the inference speed. Without the gray border, the performance of YOLOv6 deteriorates, which is also the case in [10, 42]. We postulate that the problem is related to the gray borders padding in Mosaic augmentation [1, 10]. Experiments on turning mosaic augmentations off during last epochs [7] (aka. fade strategy) are conducted for verification. In this regard, we change the area of gray border and resize the image with gray borders directly to the target image size. Combining these two strategies, our models can maintain or even boost the performance without the degradation of inference speed.
- Counter the problem of the impaired performance without adding extra gray borders at evaluation.

Quantization and Deployment

For industrial deployment, it has been common practice to adopt quantization to further speed up runtime without much performance compromise. Post-training quantization (PTQ) directly quantizes the model with only a small calibration set. Whereas quantization-aware training (QAT) further improves the performance with the access to the training set, which is typically used jointly with distillation. However, due to the heavy use of re-parameterization blocks in YOLOv6, previous PTQ techniques fail to produce high performance, while it is hard to incorporate QAT when it comes to matching fake quantizers during training and inference. We here demonstrate the pitfalls and our cures during deployment.

Reparameterizing Optimizer
- Train YOLOv6 with RepOptimizer [2] to obtain PTQ-friendly weights.
- RepOptimizer [2] proposes gradient re-parameterization at each optimization step. This technique also well solves the quantization problem of reparameterization-based models. We hence reconstruct the re-parameterization blocks of YOLOv6 in this fashion and train it with RepOptimizer to obtain PTQ-friendly weights. The distribution of feature map is largely narrowed (e.g. Fig. 4, more in B.1), which greatly benefits the quantization process, see Sec 3.5.1 for results.
Sensitivity Analysis
- We further improve the PTQ performance by partially converting quantization-sensitive operations into float computation. To obtain the sensitivity distribution, several metrics are commonly used, mean-square error (MSE), signal-noise ratio (SNR) and cosine similarity. Typically for comparison, one can pick the output feature map (after the activation of a certain layer) to calculate these metrics with and without quantization. As an alternative, it is also viable to compute validation AP by switching quantization on and off for the certain layer [29]. We compute all these metrics on the YOLOv6-S model trained with RepOptimizer and pick the top-6 sensitive layers to run in float. The full chart of sensitivity analysis can be found in B.2.
Quantization-aware Training with Channel-wise Distillation
- Adopt QAT with channel-wise distillation [36] and graph optimization to pursue extreme performance.
- In case PTQ is insufficient, we propose to involve quantization-aware training (QAT) to boost quantization performance. To resolve the problem of the inconsistency of fake quantizers during training and inference, it is necessary to build QAT upon the RepOptimizer. Besides, channelwise distillation [36] (later as CW Distill) is adapted within the YOLOv6 framework, shown in Fig. 5. This is also a self-distillation approach where the teacher network is the student itself in FP32-precision. See experiments in Sec 3.5.1.

Training Strategy

Use the same optimizer and the learning schedule as YOLOv5 [10]
Stochastic gradient descent (SGD) with momentum and cosine decay on learning rate.
Warm-up, grouped weight decay strategy and the exponential moving average (EMA).
A complete list of hyperparameter settings can be found in the released code.

Experimental Results

Dataset

We train our models on the COCO 2017 [23] training set, and the accuracy is evaluated on the COCO 2017 validation set.

Metrics

mAP.
Latency.
Flops

Experimental Results

Compared with YOLOv5-N/YOLOv7-Tiny (input size=416), our YOLOv6-N has significantly advanced by 7.9%/2.6% respectively. It also comes with the best speed performance in terms of both throughput and latency. Compared with YOLOX-S/PPYOLOE-S, YOLOv6-S can improve AP by 3.0%/0.4% with higher speed. We compare YOLOv5-S and YOLOv7-Tiny (input size=640) with YOLOv6-T, our method is 2.9% more accurate and 73/25 FPS faster with a batch size of 1. YOLOv6-M outperforms YOLOv5-M by 4.2% higher AP with a similar speed, and it achieves 2.7%/0.6% higher AP than YOLOX-M/PPYOLOE-M at a higher speed. Besides, it is more accurate and faster than YOLOv5-L. YOLOv6-L is 2.8%/1.1% more accurate than YOLOX-L/PPYOLOE-L under the same latency constraint. We additionally provide a faster version of YOLOv6-L by replacing SiLU with ReLU (denoted as YOLOv6-L-ReLU). It achieves 51.7% AP with a latency of 8.8 ms, outperforming YOLOX-L/PPYOLOE-L/YOLOv7 in both accuracy and speed.