Weighted Boxes Fusion: Ensembling Boxes from Different Object Detection Models
Roman Solovyev, Weimin Wang, Tatiana Gabruseva
{, }
Roman Solovyev, Weimin Wang, Tatiana Gabruseva
{, }
Object detection is a crucial task in computer vision systems.
When real-time inference is not required, the ensembles of models help to achieve better results.
Weighted Boxes Fusion - combining predictions of object detection models.
Our algorithm utilizes confidence scores of all proposed bounding boxes to construct the averaged boxes.
NMS
NMW
Soft-NMS
Both NMS and soft-NMS exclude some boxes, while WBF uses all boxes. Thus, it can fix cases where all boxes are predicted inaccurately by all models. NMS/soft-NMS will leave only one inaccurate box, while WBF will fuse it using all predicted boxes.
The non-maximum weighted (NMW) method proposed in [49, 26] has a similar idea. However, the NMW method does not change confidence scores; it uses IoU value to weight the boxes. NMW uses a box with the highest confidence to compare with, while WBF updates a fused box at each step, and uses it to check the overlap with the next predicted boxes. Also, NMW does not use information about how many models predict a given box in a cluster and, therefore, does not produce the best results for models’ ensemble.
Propose a novel Weighted Boxes Fusion (WBF) method for combining predictions of object detection models.
Unlike NMS and soft-NMS methods that simply remove part of the predictions, the proposed WBF method uses confidence scores of all proposed bounding boxes to construct the average boxes.
Suppose, we have bounding boxes predictions for the same image from N different models. Alternatively, we have N predictions of the same model for the original and augmented versions of the same image (i.e., vertically/horizontally reflected).
The WBF algorithm works in the following steps:
1. Each predicted box from each model is added to a single list B. The list is sorted in decreasing order of the confidence scores C.
2. Declare empty lists L and F for boxes clusters and fused boxes, respectively. Each position in the list L can contain a set of boxes (or single box), which form a cluster; each position in F contains only one box, which is the fused box from the corresponding cluster in L.
3. Iterate through predicted boxes in B in a cycle and try to find a matching box in the list F. The match is defined as a box with a large overlap with the box under question (IoU > THR). Note: in our experiments, THR = 0.55 was close to an optimal threshold.
4. If the match is not found, add the box from the list B to the end of lists L and F as new entries; proceed to the next box in the list B.
5. If the match is found, add this box to the list L at the position pos corresponding to the matching box in the list F.
6. Recalculate the box coordinates and confidence score in F[pos], using all T boxes accumulated in cluster L[pos], with the following fusion formulas:
Note: we can use some nonlinear weights as well, for instance, C2, sqrt(C), etc.
Set the confidence score for the fused box as the average confidence of all boxes that form it.
Coordinates of the fused box are weighted sums of the coordinates of the boxes that form it, where the weights are confidence scores for the corresponding boxes. Thus, boxes with larger confidence contribute more to the fused box coordinates than boxes with lower confidence.
7. After all boxes in B are processed, re-scale confidence scores in F list: multiply it by a number of boxes in a cluster and divide by a number of models N. If a number of boxes in the cluster is low, it could mean that only a small number of models predict it. Thus, we need to decrease confidence scores for such cases.
Open Images Dataset
COCO Dataset
IoU
Precision.
AP
n2 n0
θ