YOLOv3
Author
{, }
Paper:
Code:
Author
{, }
Paper:
Code:
DarkNet-53
Multi-scale Detection
Predicting the bounding box's width and height might make sense, but that leads to unstable gradients during training. Instead, most modern object detectors predict log-space transforms or offsets to pre-defined default bounding boxes called anchors.
Then, these transforms are applied to the anchor boxes to obtain the prediction. YOLO v3 has three anchors, which result in the prediction of three bounding boxes per cell.
Anchors are bounding box priors that were calculated on the COCO dataset using k-means clustering. We are going to predict the width and height of the box as offsets from cluster centroids. The box's center coordinates relative to the location of the filter application are predicted using a sigmoid function.
The following formula describes how the network output transformed to obtain bounding box predictions:
Here bx, by, bw, bh are the x, y center coordinates, width, and height of our prediction.
tx, ty, tw, th (xywh) is what the network outputs.
cx and cy are the top-left coordinates of the grid.
pw and ph are anchor dimensions for the box.
running our center coordinates prediction through a sigmoid function. This forces the value of the output to be between 0 and 1. Usually, YOLO doesn't predict the absolute coordinates of the bounding box's center. It predicts offsets which are:
Relative to the top left corner of the grid cell, which is predicting the object;
Normalized by the dimensions of the cell from the feature map, which is, 1.
For example, consider the case of our above dog image. If the prediction coordinates for the center are (0.4, 0.7), then this means that the center lies at (6.4, 6.7) on the 13 x 13 feature map. (Since the top-left coordinates of the red cell are (6,6)).
But wait, what happens if the predicted x and y coordinates are greater than one, for example (1.2, 0.7). This means that center lies at (7.2, 6.7). The center now lies in a cell just right to our red cell, or the 8th cell in the 7th row. This breaks the theory behind YOLO because if we postulate that the red box is responsible for predicting the dog, the center of the dog must lie in the red cell and not in the one beside it. So, to solve this problem, the output is passed through a sigmoid function, which squashes the output in a range from 0 to 1, effectively keeping the center in the grid which is predicting.
The dimensions of the bounding box are predicted by applying a log-space transformation to the output and then multiplying with an anchor.
Here predictions, bw, and bh, are normalized by the height and width of the image. (Training labels are chosen this way). So, if the predictions bx and by for the box containing the dog are (0.3, 0.8), then the actual width and height on the 13 x 13 feature map are (13∗0.3, 13∗0.8).
Fir of all, the object score represents the probability that an object contains inside a bounding box. It should be nearly 1 for the red and the neighboring grids, whereas almost 0 for the grid at the corners.
The objectness score is also passed through a sigmoid, which should we can interpret as a probability.
Talking about Class confidences, they represent the probabilities of the detected object belonging to a particular class (Dog, Cat, person, car, bicycle, etc.). In older YOLO versions, the softmax activation function was used to calculate the class scores.
In YOLO, authors have decided to use sigmoid instead. The reason is that Softmaxing class scores assume that the classes are mutually exclusive. In simple words, if an object belongs to one class, then it cannot belong to another class. This is true for the COCO database, which we will implement first.
YOLO v3 predicts three different scales. The detection layer is used to detect feature maps of three different sizes, having strides 32, 16, 8, respectively. This means that with an input of 416 x 416, we make detections on scales 13 x 13, 26 x 26, and 52 x 52.
The Network downsamples the input image until the first detection layer, where detection is made using feature maps of a layer with stride 32. Further, layers are upsampled by a factor of 2 and concatenated with feature maps of a previous layer having identical feature map sizes. Another detection is now made at layer with stride 16. The same upsampling procedure is repeated, and a final detection is made at the layer of stride 8.
Each cell predicts three bounding boxes using three given anchors at each scale, making the total number of anchors used 9. (The anchors are different for different scales).
n2 n0
θ