YOLOv7: Trainable Bag-of-Freebies sets New State-of-the-Art for Real-Time Object Detectors
WongKinYiu
{E-ELAN, Extend and Compound Scaling, Coarse-to-Fine Lead Head Guided Label Assigner, Trainable Bag-of-Freebies}
WongKinYiu
{E-ELAN, Extend and Compound Scaling, Coarse-to-Fine Lead Head Guided Label Assigner, Trainable Bag-of-Freebies}
YOLO Series.
YOLOv7 surpasses all known object detectors in both speed and accuracy in the range from 5 FPS to 160 FPS and has the highest accuracy 56.8% AP among all known real-time object detectors with 30 FPS or higher on GPU V100.
YOLOv7-E6 object detector (56 FPS V100, 55.9% AP) outperforms:
Transformer-based detectors SWIN-L Cascade-Mask R-CNN (9.2 FPS A100, 53.9% AP) by 509% in speed and 2% in accuracy
Convolutional-based detector ConvNeXt-XL Cascade-Mask R-CNN (8.6 FPS A100, 55.2% AP) by 551% in speed and 0.7% AP in accuracy.
YOLOv7 outperforms: YOLOR, YOLOX, Scaled-YOLOv4, YOLOv5, DETR, Deformable DETR, DINO-5scale-R50, ViT-Adapter-B and many other object detectors in speed and accuracy.
Train YOLOv7 only on the MS COCO dataset from scratch without using any other datasets or pre-trained weights.
Producing low-power single-chip, and improving the inference speed on edge CPU: MCUNet [48, 49], NanoNet [54]
Improving the inference speed of various GPUs: YOLOX [21], YOLOR [81]
Running on CPU: [54, 88, 83, 84] focusing on MobileNet [28, 66, 27], ShuffleNet [92, 55], or GhostNet [25].
Developed for GPUs: [81, 21, 97] mostly use ResNet [26], DarkNet [63], DLA [87], CSPNet [80].
Based on YOLO [61, 62, 63], FCOS [76, 77]: [3, 79, 81, 21, 54, 85, 23]
Requirement to become SOTA real-time object detectors:
A faster and stronger network architecture.
A more effective feature integration method [22, 97, 37, 74, 59, 30, 9, 45].
A more accurate detection method [76, 77, 69].
A more robust loss function [96, 64, 6, 56, 95, 57].
A more efficient label assignment method [99, 20, 17, 82, 42].
A more efficient training method.
Besides, self-supervised learning and distillation.
Model re-parameterization [13, 12, 29]
Dynamic Label Assignment [20, 17, 42]
[71, 31, 75, 19, 33, 11, 4, 24, 13, 12, 10, 29, 14, 78]
Merge multiple computational modules into one at inference stage.
Two categories:
module-level ensemble: split a module into multiple identical or different module branches during training and integrate multiple branched modules into a completely equivalent module during inference.
model-level ensemble: (1) train multiple models with different training data, and then average the weights of multiple trained models; (2) perform a weighted average of the weights of models at different iteration number.
[72, 60, 74, 73, 15, 16, 2, 51]
Scale up and down models and make them fit in different computing devices.
Scaling factors:
resolution (input image).
depth (layer).
stage (feature pyramid).
Prior Blocks:
VoVNet [39]: An aggregation block in VoVNet.
CSPVoVNet [55]: proposed in Scaled-YOLOv4. The gradient path is analyzed in order to enable the weights of different layers to learn more diverse features.
ELAN [1]: By controlling the shortest longest gradient path, a deeper network can learn and converge effectively.
Propose: E-ELAN (Extended Efficient Layer Aggregation Networks)
Achieve the ability to continuously enhance the learning ability of the network without destroying the gradient transmission path of the original architecture.
Group Convolution is used to expand the channel and cardinality of the computational block.
The feature map calculated by each computational block will be shuffled into g groups and then concatenated together.
g groups of feature maps are added to perform merge cardinality.
In addition to maintaining the original ELAN design architecture, E-ELAN can also guide different groups of computational blocks to learn more diverse features.
E-ELAN-PAN.
Lead head (YOLOR) for final output.
Auxiliary head for middle layer output.
Detection loss: CIoU loss.
(https://github.com/WongKinYiu/yolov7/blob/main/utils/loss.py#L468-L469)
Objectness loss: BCE loss.
(https://github.com/WongKinYiu/yolov7/blob/main/utils/loss.py#L485)
Classification loss: BCE loss with Focal Loss.
(https://github.com/WongKinYiu/yolov7/blob/main/utils/loss.py#L479)
(a) and (b): It is observed that when depth scaling is performed on concatenation-based models, the output width of a computational block also increases. This phenomenon will cause the input width of the subsequent transmission layer to increase.
EfficientNet [72]: width, depth, resolution.
Scaled-YOLOv4 [79]: number of stages.
[15]: influence of vanilla convolution and group convolution on the amount of parameter and computation when performing width and depth scaling.
(c): When performing model scaling on concatenation-based models, only the depth in a computational block needs to be scaled, and the remaining transmission layer is performed with corresponding width scaling.
The proposed compound scaling method can maintain the properties that the model had at the initial design and maintains the optimal structure.
Label Assignment (LA) - SimOTA (Simple OTA)
Re-parameterization
Model scaling
Implicit Knowledge
Deep Supervision (Auxiliary head)
Read part 3.1 in paper FCOS and 3.3 in paper ATSS.
RepVGG [13]: RepConv actually combines 3×3 convolution, 1×1 convolution, and identity connection in one convolutional layer.
When RepConv is directly applied to ResNet and DenseNet and other architectures, its accuracy will be significantly reduced (destroys the residual in ResNet and the concatenation in DenseNet, which provides more diversity of gradients for different feature maps).
Proposed RepConvN: RepConv without identity connection is used to design the architecture of planned re-parameterized convolution.
When a convolutional layer with residual or concatenation is replaced by re-parameterized convolution, there should be no identity connection.
[38] The head is responsible for the final output as the lead head, and the head used to assist training is called the auxiliary head.
Normal: Only final heads are used for loss estimation.
Model with Auxiliary Head: With deep supervision [70, 98, 67, 47, 82, 65, 86, 50], there are auxiliary heads from the intermediate layers to guide the network.
Independent: Researchers often use the quality and distribution of prediction output by the network, and then consider it together with the ground truth. The mechanism considers the network prediction results together with the ground truth and then assigns soft labels [61, 8, 36, 99, 91, 44, 43, 90, 20, 17, 42] as “label assigner”.
Lead Head Guided Label Assigner: By letting the shallower auxiliary head directly learn the information that lead head has learned, lead head will be more able to focus on learning residual information that has not yet been learned.
Coarse-to-Fine Lead Head Guided Label Assigner: Similar to (d), however, two different sets of soft labels are generated, i.e., coarse label and fine label. Fine label is the same as the soft label generated by lead head guided label assigner.
Coarse label is generated by allowing more grids to be treated as positive targets by relaxing the constraints of the positive sample assignment process.
Batch normalization is in conv-bn-activation topology: The purpose of this is to integrate the mean and variance of batch normalization into the bias and weight of the convolutional layer at the inference stage.
Implicit knowledge in YOLOR [81] combined with convolution feature map in addition and multiplication manner: Implicit knowledge in YOLOR can be simplified to a vector by pre-computing at the inference stage. This vector can be combined with the bias and weight of the previous or subsequent convolutional layer.
class ImplicitA(nn.Module):
def __init__(self, channel, mean=0., std=.02):
super(ImplicitA, self).__init__()
self.channel = channel
self.mean = mean
self.std = std
self.implicit = nn.Parameter(torch.zeros(1, channel, 1, 1))
nn.init.normal_(self.implicit, mean=self.mean, std=self.std)
def forward(self, x):
return self.implicit + x
class ImplicitM(nn.Module):
def __init__(self, channel, mean=0., std=.02):
super(ImplicitM, self).__init__()
self.channel = channel
self.mean = mean
self.std = std
self.implicit = nn.Parameter(torch.ones(1, channel, 1, 1))
nn.init.normal_(self.implicit, mean=self.mean, std=self.std)
def forward(self, x):
return self.implicit * x
EMA model: EMA is a technique used in Mean Teacher, and EMA model is only used purely as the final inference model.
MS COCO 2017
YOLOv7- tiny, YOLOv7, and YOLOv7-W6 for edge GPU, normal GPU, and cloud GPU.
For YOLOv7, we do stack scaling on neck, and use the proposed compound scaling method to perform scaling-up of the depth and width of the entire model, and use this to obtain YOLOv7-X.
As for YOLOv7-W6, we use the newly proposed compound scaling method to obtain YOLOv7-E6 and YOLOv7-D6.
Use the proposed EELAN for YOLOv7-E6, and thereby complete YOLOv7- E6E.
Since YOLOv7-tiny is an edge GPU-oriented architecture, it will use leaky ReLU as activation function.
As for other models we use SiLU as activation function.
Figure. Comparison of baseline object detectors
Figure. Comparison of state-of-the-art real-time object detectors
Improve the training process to better detect small objects
Allow for more fine-grained control over object size thresholds
Improve the default anchor sizes to better handle small objects
Fusing Conv + BN => Conv
Having a mini-batch B of m elements, calculate mean and variance of B:
With an input is a vector x having d-dimension, x=(x(1), x(2), ..., x(d)). The normalized version:
with k ∈ [1, d] and i ∈ [1, m]
We add 2 learnable parameter γ(k) and β(k) into Batch-Norm layer:
Open up the Batch-Norm layer into a tensor M of 4-dimension (batch, channels, height, width):
To combine Conv and Batch-Norm layer, create a new-Conv with weight and bias:
Note that, BN is performed on the channels dimension, but tensor weight of Conv in Pytoch:
New-BN after combining new-Conv into BN:
def fuse_conv_and_bn(conv, bn):
# tạo ra một lớp Conv mới là lớp Conv kết hợp BN với lớp Conv cũ
fusedconv = nn.Conv2d(conv.in_channels,
conv.out_channels,
kernel_size=conv.kernel_size,
stride=conv.stride,
padding=conv.padding,
groups=conv.groups,
bias=True).requires_grad_(False).to(conv.weight.device)
# Prepare filters
w_conv = conv.weight.clone().view(conv.out_channels, -1) # weight của lớp conv cũ
w_bn = torch.diag(bn.weight.div(torch.sqrt(bn.eps + bn.running_var))) # tạo ra phân số trong phép tính (1)
fusedconv.weight.copy_(torch.mm(w_bn, w_conv).view(fusedconv.weight.shape)) # tính weight mới cho lớp Conv mới theo công thức (1)
# Prepare spatial bias
b_conv = torch.zeros(conv.weight.size(0), device=conv.weight.device) if conv.bias is None else conv.bias
b_bn = bn.bias - bn.weight.mul(bn.running_mean).div(torch.sqrt(bn.running_var + bn.eps))
fusedconv.bias.copy_(torch.mm(w_bn, b_conv.reshape(-1, 1)).reshape(-1) + b_bn) # công thức (2)
return fusedconv
Tranning YOLOv7 on Custom dataset: https://towardsdatascience.com/yolov7-a-deep-dive-into-the-current-state-of-the-art-for-object-detection-ce3ffedeeaeb#162b
A clean, modular Implementation of YOLOv7: https://github.com/Chris-hughes10/Yolov7-training
n2 n0
θ