[YOLOv5] Glenn, J. Yolov5-6.1— TensorRT, TensorFlow Edge TPU and OpenVINO Export and Inference.
Ultralytics
{CBS, C3, SPPF}
Ultralytics
{CBS, C3, SPPF}
YOLO Series.
YOLOv5 (v6.0/6.1) is a powerful object detection algorithm developed by Ultralytics.
(New) CSP-Darknet53 + (New) CSP-PAN + YOLOv3 Head.
Focus layer => 6x6 Conv2d.
SPP => SPPF.
Figure. YOLOv5l
Figure. YOLOv5: Overall Architecture
Copy Paste
Random Affine (Rotation, Scale x0.5 - 1.5, Translation, Shear)
Augment HSV (Hue, Saturation, Value)
Random Horizontal Flip
Mosaic
Copy Paste
Random Affine
Random Horizontal Flip
Augment HSV
MixUp Augmentation
New CSPDarknet53.
Start with 6x6 Conv2D (instead of Focus layer - space-to-depth operation).
Stacking of multiple CBS (Conv + BatchNorm + SILU) modules and C3 modules.
SPPF module is connected at the end.
Figure. YOLOv5: Model Architecture
Figure. Parameters of YOLOv5 Backbone.
CBS module is used to assist C3 module in feature extraction, while SPPF module enhances the feature expression ability of the backbone.
SPPF avoided the repeated operation of SPP as in SPPNet, by max pooling the previous max pooled features.
Figure. Structure of SPPF
class SPPF(nn.Module):
def __init__(self):
super().__init__()
self.maxpool = nn.MaxPool2d(5, 1, padding=2)
def forward(self, x):
o1 = self.maxpool(x)
o2 = self.maxpool(o1)
o3 = self.maxpool(o2)
return torch.cat([x, o1, o2, o3], dim=1)
New CSP-PAN.
Figure. Current Necks. (a) Without Feature Fusion, (b) FPN (+ top-down) and (c) PAN (+ bottom-up).
Figure. New CSP-PAN (Within Dashed Box)
The process of adjusting the center coordinate and size of the preset prior anchor to the center coordinate and size of the final prediction box.
The upper left corner coordinate of the feature map is set to (0, 0).
rx and ry are the unadjusted coordinates of the predicted center point (grid cell coordinates).
gx, gy, gw, gh represent the information of the adjusted prediction box (final output coordinates).
pw and ph are for the information of the prior anchor.
sx and sy represent the offsets calculated by the model.
Match positive samples.
Calculate the aspect ratio of GT and Anchor Templates.
Assign the successfully matched Anchor Templates to the corresponding cells:
Because the center point offset range is adjusted from (0, 1) to (-0.5, 1.5).
GT Box can be assigned to more anchors.
There are 5 versions of YOLOv5: YOLOv5x, YOLOv5l, YOLOv5m, YOLOv5s, and YOLOv5n.
There are 5 larger versions: YOLOv5x6, YOLOv5l6, YOLOv5m6, YOLOv5s6, and YOLOv5n6.
Detection: CIoU Loss
Objecness Score: BCE Loss
Classification: BCE Loss
The objectness losses of the three prediction layers (P3, P4, P5) are weighted differently.
The balance weights are [4.0, 1.0, 0.4] respectively.
The balancing terms were based on the obj losses seen at the 3 output layers. We simply averaged them over a few epochs of early training and set them to their present values. The smaller output layers have more imbalance than the larger object output layers in COCO, and probably in many other datasets as well, but performance will naturally vary by dataset, so I'm not sure if the balancing helps or hurts. This question is interrelated to the number of output layers, and the anchors used. [Ref]
Multi-scale training: The input images are randomly rescaled within a range of (0.5~1.5x) of their original size.
AutoAnchor (For training custom data): Optimize the prior anchor boxes to match the statistical characteristics of the GT boxes in custom data. (Kmeans + Genetic Algorithm)
Warmup and Cosine LR scheduler: Adjust the learning rate to enhance model performance.
EMA (Exponential Moving Average): A strategy that uses the average of parameters over past steps to stabilize the training process and reduce generalization error.
"""Model Exponential Moving Average from https://github.com/rwightman/pytorch-image-models
Keep a moving average of everything in the model state_dict (parameters and buffers). This is intended to allow functionality like: https://www.tensorflow.org/api_docs/python/tf/train/ExponentialMovingAverage. A smoothed version of the weights is necessary for some training schemes to perform well.
This class is sensitive where it is initialized in the sequence of model init, GPU assignment, and distributed training wrappers.
Mixed precision: A method to perform operations in half-precision format, reducing memory usage and enhancing computational speed.
Evolve hyper-parameters using Genetic Algorithm: Automatically tune hyperparameters to achieve optimal performance.
MSCOCO.
All YOLOv5 larger models outperform EfficientDet by a large margin.
In YOLOv2 and YOLOv3, the formula for calculating the predicted target information is:
In YOLOv5, the formula is:
Compare the center point offset before and after scaling.
The center point offset range is adjusted from (0, 1) to (-0.5, 1.5).
Therefore, the offset can easily get 0 or 1.
Compare the height and width scaling ratio (relative to anchor) before and after adjustment, the original Yolo-darknet box equations have a serious flaw.
Width and Height are completely unbounded as they are simply out=exp(in), which is dangerous, as it can lead to runaway gradients, instabilities, NaN losses, and ultimately a complete loss of training. refer this issue.
ClearML
https://www.youtube.com/watch?v=MX3BrXnaULs
n2 n0
θ