YOLOv7: Trainable Bag-of-Freebies sets New State-of-the-Art for Real-Time Object Detectors

WongKinYiu

{E-ELAN, Extend and Compound Scaling, Coarse-to-Fine Lead Head Guided Label Assigner, Trainable Bag-of-Freebies}

Paper: https://arxiv.org/abs/2207.02696

Code: https://github.com/WongKinYiu/yolov7

Motivation, Objectives and Related Works

Motivation

YOLO Series.

Objectives

YOLOv7 surpasses all known object detectors in both speed and accuracy in the range from 5 FPS to 160 FPS and has the highest accuracy 56.8% AP among all known real-time object detectors with 30 FPS or higher on GPU V100.
YOLOv7-E6 object detector (56 FPS V100, 55.9% AP) outperforms:
1. Transformer-based detectors SWIN-L Cascade-Mask R-CNN (9.2 FPS A100, 53.9% AP) by 509% in speed and 2% in accuracy
2. Convolutional-based detector ConvNeXt-XL Cascade-Mask R-CNN (8.6 FPS A100, 55.2% AP) by 551% in speed and 0.7% AP in accuracy.
YOLOv7 outperforms: YOLOR, YOLOX, Scaled-YOLOv4, YOLOv5, DETR, Deformable DETR, DINO-5scale-R50, ViT-Adapter-B and many other object detectors in speed and accuracy.
Train YOLOv7 only on the MS COCO dataset from scratch without using any other datasets or pre-trained weights.

Related Works

Real-time Object Detectors

Producing low-power single-chip, and improving the inference speed on edge CPU: MCUNet [48, 49], NanoNet [54]
Improving the inference speed of various GPUs: YOLOX [21], YOLOR [81]
Running on CPU: [54, 88, 83, 84] focusing on MobileNet [28, 66, 27], ShuffleNet [92, 55], or GhostNet [25].
Developed for GPUs: [81, 21, 97] mostly use ResNet [26], DarkNet [63], DLA [87], CSPNet [80].
Based on YOLO [61, 62, 63], FCOS [76, 77]: [3, 79, 81, 21, 54, 85, 23]
Requirement to become SOTA real-time object detectors:
1. A faster and stronger network architecture.
2. A more effective feature integration method [22, 97, 37, 74, 59, 30, 9, 45].
3. A more accurate detection method [76, 77, 69].
4. A more robust loss function [96, 64, 6, 56, 95, 57].
5. A more efficient label assignment method [99, 20, 17, 82, 42].
6. A more efficient training method.
7. Besides, self-supervised learning and distillation.

Network Training

Model re-parameterization [13, 12, 29]
Dynamic Label Assignment [20, 17, 42]

Model- re-parameterization

[71, 31, 75, 19, 33, 11, 4, 24, 13, 12, 10, 29, 14, 78]
Merge multiple computational modules into one at inference stage.
Two categories:
1. module-level ensemble: split a module into multiple identical or different module branches during training and integrate multiple branched modules into a completely equivalent module during inference.
2. model-level ensemble: (1) train multiple models with different training data, and then average the weights of multiple trained models; (2) perform a weighted average of the weights of models at different iteration number.

Model Scaling

[72, 60, 74, 73, 15, 16, 2, 51]
Scale up and down models and make them fit in different computing devices.
Scaling factors:
1. resolution (input image).
2. depth (layer).
3. stage (feature pyramid).

Model

Architecture

Backbone

- Prior Blocks:
  1. VoVNet [39]: An aggregation block in VoVNet.
  2. CSPVoVNet [55]: proposed in Scaled-YOLOv4. The gradient path is analyzed in order to enable the weights of different layers to learn more diverse features.
  3. ELAN [1]: By controlling the shortest longest gradient path, a deeper network can learn and converge effectively.
- Propose: E-ELAN (Extended Efficient Layer Aggregation Networks)
  1. Achieve the ability to continuously enhance the learning ability of the network without destroying the gradient transmission path of the original architecture.
  2. Group Convolution is used to expand the channel and cardinality of the computational block.
  3. The feature map calculated by each computational block will be shuffled into g groups and then concatenated together.
  4. g groups of feature maps are added to perform merge cardinality.

In addition to maintaining the original ELAN design architecture, E-ELAN can also guide different groups of computational blocks to learn more diverse features.

Neck

- E-ELAN-PAN.

Head

- Lead head (YOLOR) for final output.
- Auxiliary head for middle layer output.

Loss Functions

Detection loss: CIoU loss.

(https://github.com/WongKinYiu/yolov7/blob/main/utils/loss.py#L468-L469)

Objectness loss: BCE loss.

(https://github.com/WongKinYiu/yolov7/blob/main/utils/loss.py#L485)

Classification loss: BCE loss with Focal Loss.

(https://github.com/WongKinYiu/yolov7/blob/main/utils/loss.py#L479)

Model Scaling for Concatenation-based Models

(a) and (b): It is observed that when depth scaling is performed on concatenation-based models, the output width of a computational block also increases. This phenomenon will cause the input width of the subsequent transmission layer to increase.
- EfficientNet [72]: width, depth, resolution.
- Scaled-YOLOv4 [79]: number of stages.
- [15]: influence of vanilla convolution and group convolution on the amount of parameter and computation when performing width and depth scaling.
(c): When performing model scaling on concatenation-based models, only the depth in a computational block needs to be scaled, and the remaining transmission layer is performed with corresponding width scaling.

The proposed compound scaling method can maintain the properties that the model had at the initial design and maintains the optimal structure.

Bag of Free-bies

Label Assignment (LA) - SimOTA (Simple OTA)
Re-parameterization
Model scaling
Implicit Knowledge
Deep Supervision (Auxiliary head)

Label Assignment

- Read part 3.1 in paper FCOS and 3.3 in paper ATSS.

Re-parameterization

RepVGG [13]: RepConv actually combines 3×3 convolution, 1×1 convolution, and identity connection in one convolutional layer.
When RepConv is directly applied to ResNet and DenseNet and other architectures, its accuracy will be significantly reduced (destroys the residual in ResNet and the concatenation in DenseNet, which provides more diversity of gradients for different feature maps).
Proposed RepConvN: RepConv without identity connection is used to design the architecture of planned re-parameterized convolution.

When a convolutional layer with residual or concatenation is replaced by re-parameterized convolution, there should be no identity connection.

Coarse for Auxiliary and Fine for Lead Loss

[38] The head is responsible for the final output as the lead head, and the head used to assist training is called the auxiliary head.
1. Normal: Only final heads are used for loss estimation.
2. Model with Auxiliary Head: With deep supervision [70, 98, 67, 47, 82, 65, 86, 50], there are auxiliary heads from the intermediate layers to guide the network.
3. Independent: Researchers often use the quality and distribution of prediction output by the network, and then consider it together with the ground truth. The mechanism considers the network prediction results together with the ground truth and then assigns soft labels [61, 8, 36, 99, 91, 44, 43, 90, 20, 17, 42] as “label assigner”.
4. Lead Head Guided Label Assigner: By letting the shallower auxiliary head directly learn the information that lead head has learned, lead head will be more able to focus on learning residual information that has not yet been learned.
5. Coarse-to-Fine Lead Head Guided Label Assigner: Similar to (d), however, two different sets of soft labels are generated, i.e., coarse label and fine label. Fine label is the same as the soft label generated by lead head guided label assigner.

Coarse label is generated by allowing more grids to be treated as positive targets by relaxing the constraints of the positive sample assignment process.

Other Trainable Bag-of-Freebies

Batch normalization is in conv-bn-activation topology: The purpose of this is to integrate the mean and variance of batch normalization into the bias and weight of the convolutional layer at the inference stage.
Implicit knowledge in YOLOR [81] combined with convolution feature map in addition and multiplication manner: Implicit knowledge in YOLOR can be simplified to a vector by pre-computing at the inference stage. This vector can be combined with the bias and weight of the previous or subsequent convolutional layer.

class ImplicitA(nn.Module):

def __init__(self, channel, mean=0., std=.02):

super(ImplicitA, self).__init__()

self.channel = channel

self.mean = mean

self.std = std

self.implicit = nn.Parameter(torch.zeros(1, channel, 1, 1))

nn.init.normal_(self.implicit, mean=self.mean, std=self.std)

def forward(self, x):

return self.implicit + x

class ImplicitM(nn.Module):

def __init__(self, channel, mean=0., std=.02):

super(ImplicitM, self).__init__()

self.channel = channel

self.mean = mean

self.std = std

self.implicit = nn.Parameter(torch.ones(1, channel, 1, 1))

nn.init.normal_(self.implicit, mean=self.mean, std=self.std)

def forward(self, x):

return self.implicit * x

EMA model: EMA is a technique used in Mean Teacher, and EMA model is only used purely as the final inference model.

Experimental Results

Dataset

MS COCO 2017

Metrics

YOLOv7- tiny, YOLOv7, and YOLOv7-W6 for edge GPU, normal GPU, and cloud GPU.
For YOLOv7, we do stack scaling on neck, and use the proposed compound scaling method to perform scaling-up of the depth and width of the entire model, and use this to obtain YOLOv7-X.
As for YOLOv7-W6, we use the newly proposed compound scaling method to obtain YOLOv7-E6 and YOLOv7-D6.
Use the proposed EELAN for YOLOv7-E6, and thereby complete YOLOv7- E6E.
Since YOLOv7-tiny is an edge GPU-oriented architecture, it will use leaky ReLU as activation function.
As for other models we use SiLU as activation function.

Experimental Results

Figure. Comparison of baseline object detectors

Figure. Comparison of state-of-the-art real-time object detectors

Key Takeaways

Still struggling with small object detections

Improve the training process to better detect small objects

Allow for more fine-grained control over object size thresholds

Improve the default anchor sizes to better handle small objects

Fusing Layers - from Re-parameterization

Overview

- Fusing Conv + BN => Conv

Batch Normalization

- Having a mini-batch B of m elements, calculate mean and variance of B:

- With an input is a vector x having d-dimension, x=(x(1), x(2), ..., x(d)). The normalized version:

- with k ∈ [1, d] and i ∈ [1, m]
- We add 2 learnable parameter γ(k) and β(k) into Batch-Norm layer:

- Open up the Batch-Norm layer into a tensor M of 4-dimension (batch, channels, height, width):

- To combine Conv and Batch-Norm layer, create a new-Conv with weight and bias:

- Note that, BN is performed on the channels dimension, but tensor weight of Conv in Pytoch:

- New-BN after combining new-Conv into BN:

def fuse_conv_and_bn(conv, bn):

# tạo ra một lớp Conv mới là lớp Conv kết hợp BN với lớp Conv cũ

fusedconv = nn.Conv2d(conv.in_channels,

conv.out_channels,

kernel_size=conv.kernel_size,

stride=conv.stride,

padding=conv.padding,

groups=conv.groups,

bias=True).requires_grad_(False).to(conv.weight.device)

# Prepare filters

w_conv = conv.weight.clone().view(conv.out_channels, -1) # weight của lớp conv cũ

w_bn = torch.diag(bn.weight.div(torch.sqrt(bn.eps + bn.running_var))) # tạo ra phân số trong phép tính (1)

fusedconv.weight.copy_(torch.mm(w_bn, w_conv).view(fusedconv.weight.shape)) # tính weight mới cho lớp Conv mới theo công thức (1)

# Prepare spatial bias

b_conv = torch.zeros(conv.weight.size(0), device=conv.weight.device) if conv.bias is None else conv.bias

b_bn = bn.bias - bn.weight.mul(bn.running_mean).div(torch.sqrt(bn.running_var + bn.eps))

fusedconv.bias.copy_(torch.mm(w_bn, b_conv.reshape(-1, 1)).reshape(-1) + b_bn) # công thức (2)

return fusedconv

References

- n2 n0
- θ

Page updated

Google Sites

Report abuse

YOLOv7: Trainable Bag-of-Freebies sets New State-of-the-Art for Real-Time Object Detectors

Motivation, Objectives and Related Works

Motivation

Objectives

Related Works

Real-time Object Detectors

Network Training

Model- re-parameterization

Model Scaling

Model

Architecture

Backbone

Neck

Head

Loss Functions

Model Scaling for Concatenation-based Models

Bag of Free-bies

Label Assignment

Re-parameterization

Coarse for Auxiliary and Fine for Lead Loss

Other Trainable Bag-of-Freebies

Experimental Results

Dataset

Metrics

Experimental Results

Key Takeaways

Still struggling with small object detections

Fusing Layers - from Re-parameterization

Overview

Batch Normalization

References

About Me: