Bag of Freebies for
Training Object Detection Neural Networks
In this summary,
In this summary,
0. Motivation, Objectives, and Related Works:
+ Motivation:
Surveying training heuristics (strategies and pipelines) for Object Detection tasks.
+ Objectives:
Exploring effective and general approaches (boost the performance without introducing extra computational cost during inference):
The mix-up technique.
Training pipelines: learning rate scheduling, label smoothing, and synchronized BatchNorm.
Incrementally stacking to train single and multiple stage networks.
Result: Improve up to 5% absolute precision compared to state-of-the-art baselines.
+ Related Works:
(1) Scattering tricks from Image Classification:
Learning rate warmup heuristic [6] Overcome the negative effect of extremely large mini-batch size.
A large amount of anchor size (up to 30k) is effectively contributing to batch size implicitly.
A gradual warmup heuristic is crucial to YOLOv3 [16].
Label smoothing [22] modifies the hard ground truth labeling in cross entropy loss.
Mixup [24] alleviates adversarial perturbation.
Cosine annealing strategy for learning rate decay [13].
(2) Deep Object Detection Pipelines:
Derived from multiple stages and single stage pipelines:
In single stage pipelines, predictions are generated by a single convolutional network and therefore preserve the spatial alignments (except that YOLO used Fully Connected layers at the end).
In multiple stage pipelines, e.g. Fast R-CNN [3] and Faster-RCNN [17], final predictions are generated from features which are sampled and pooled in a specific region of interests (RoIs).
RoIs are either propagated by neural networks or deterministic algorithms.
Due to the lack of spatial variation in single stage pipelines, spatial data augmentation is crucial to the performance as proven in SingleShot MultiBox Object Detector (SSD) [12].
Due to lack of exploration, many training details are exclusive to one series.
1. Bag of Freebies
1.1 Visually Coherent Image Mixup [24] for Object Detection
Key idea: to regularize the neural network to favor simple linear behavior by mixing up pixels as interpolations between pairs of training images. At the same time, one-hot image labels are mixed up using the same ratio. Fig. 2.
The distribution of blending ratio in mixup algorithm is drawn from a beta distribution B(0.2, 0.2). The majority of mixups are barely noises with such beta distributions.
By applying more complex spatial transforms, we introduce occlusions, spatial signal perturbations that are common in natural image presentations
Experiments: continue increasing blending ratio B, the objects in resulting frames are more vibrant and coherent to the natural presentations, similar to the transition frames commonly observed when we are watching low FPS movies or surveillance videos. (Fig. 2 and Fig. 3) In particular, we use geometry preserved alignment for image mixup to avoid distort images at the initial steps. We also choose a beta distribution with α and β are both at least 1, which is more visually coherent, instead of following the same practice in image classification, as depicted in Figure 4.
Verify: we experimentally tested empirical mixup ratio distributions using the YOLOv3 network on Pascal VOC dataset. Table. 1 shows the actual improvements by adopting detection mixups with ratios sampled by different beta distributions. Beta distribution with α and β both equal to 1.5 is marginally better than 1.0 (equivalent to uniform distribution) and better than fixed even mixup. We recognize that for object detection where mutual object occlusion is common, networks are encouraged to observe unusual crowded patches, either presented naturally or created by adversarial techniques.
To validate the effectiveness of visually coherent mixup, we followed the same experiments of ”Elephant in the room” [18] by sliding an elephant image patch through an indoor room image. We trained two YOLOv3 models on COCO 2017 dataset with identical settings except for that model mix is using our mixup approach. We depict some surprising discoveries in Fig. 5. As we can observe in Fig. 5, vanilla model trained without our mix approach is struggles to detect ”elephant in the room” due to heavy occlusion and lack of context because it’s rare to capture an elephant in a kitchen. Actually, there is no such training image after examine the common training datasets. In comparison, models trained with our mix approach is more robust thanks to randomly generated visually deceptive training images. In addition, we also notice that mix model is more humble, less confident and generates lower scores for objects on average. However, this behavior does not affect evaluation results as shown in experimental results. We evaluated the model performance against fake video with elephant sliding through, and the results are listed in Table. 2. It is obvious that model trained with visually coherent mixup is more robust (94.12 vs. 42.95) to detect elephant in indoor scene even though it is very rare in natural images. And mixup model can preserve crowded furniture objects under heavy occlusion of alien elephant image patch. We recognize that mixup model receives more challenges during training therefore is significantly better than vanilla model in handling unprecedented scenes and very crowded object groups.
1.2 Classification Head Label Smoothing
For each object, detection networks often compute a probability distribution over all classes with softmax function:
pi = e zi P j e zj , (1)
where zis are the unnormalized logits directly from the last linear layer for classification prediction. For object detection during training, we only modify the classification loss by comparing the output distribution p against the ground truth distribution q with cross-entropy
L = − X i qi log pi .
q is often a one-hot distribution, where the correct class has probability one while all other classes have zero. Softmax function, however, can only approach this distribution when zi zj , ∀j 6= i but never reach it. This encourages the model to be too confident in its predictions and is prone to over-fitting.
Label smoothing was proposed by Szegedy et al. [22] as a form of regularization. We smooth the ground truth distribution with
qi = ( 1 − ε if i = y, ε/(K − 1) otherwise, (3)
where K is the total number of classes and ε is a small constant. This technique reduces the model’s confidence, measured by the difference between the largest and smallest logits.
In the case of sigmoid outputs of range 0 to 1.0 as in YOLOv3 [16], label smoothing is even simpler by correcting the upper and lower limit of the range of targets as in Eq. 3.
1.3 Data Preprocessing
In image classification domain, usually neural networks are extremely tolerant to image geometrical transformation. It is actually encouraged to randomly perturb the spatial characteristics, e.g. randomly flip, rotate and crop images in order to improve generalization accuracy and avoid overfitting. However, for object detection image preprocessing, we need to carry additional cautious since detection networks are more sensitive to such transformations. We experimentally review the following data augmentation methods:
Random geometry transformation. Including random cropping (with constraints), random expansion, random horizontal flip and random resize (with random interpolation).
Random color jittering including brightness, hue, saturation, and contrast
In terms of types of detection networks, there are two pipelines for generating final predictions. First is single stage detector network, where final outputs are generated from every single cell in the feature map, for example SSD[12] and YOLO[16] networks which generate detection results proportional to spatial shape of an input image. The second is multi-stage proposal and sampling based approaches, following Fast-RCNN[17], where a certain number of candidates are sampled from a large pool of generated ROIs, then the detection results are produced by repeatedly cropping the corresponding regions on feature maps, and the number of predictions is proportional to number of samples.
Since sampling-based approaches conduct enormous cropping operations on feature maps, it substitutes the operation of randomly cropping input images, therefore these networks do not require extensive geometric augmentations applied during the training stage. This is the major difference between one-stage and so called multi-stage object detection data pipelines. In our Faster-RCNN training, we do not use random cropping techniques during data augmentation.
1.4 Training Schedule Revamping
During training, the learning rate usually starts with a relatively big number and gradually becomes smaller throughout the training process. For example, the step schedule is the most widely used learning rate schedule. With step schedule, the learning rate is multiplied by a constant number below 1 after reaching pre-defined epochs or iterations. For instance, the default step learning rate schedule for Faster-RCNN [17] is to reduce learning rate by ratio 0.1 at 60k iterations. Similarly, YOLOv3 [16] uses same ratio 0.1 to reduce learning rate at 40k and 45k iterations. Step schedule has sharp learning rate transition which may cause the optimizer to re-stabilize the learning momentum in the next few iterations. In contrast, a smoother cosine learning rate adjustment was proposed by Loshchilov et al. [13]. Cosine schedule scales the learning rate according to the value of cosine function on 0 to pi. It starts with slowly reducing large learning rate, then reduces the learning rate quickly halfway, and finally ends up with tiny slope reducing small learning rate until it reaches 0. In our implementation, we follow He et al. [8] but the numbers of iterations are adjusted according to object detection networks and datasets
Warmup learning rate is another common strategy to avoid gradient explosion during the initial training iterations. Warmup learning rate schedule is critical to several object detection algorithms, e.g., YOLOv3, which has a dominant gradient from negative examples in the very beginning iterations where sigmoid classification score is initialized around 0.5 and biased towards 0 for the majority predictions.
Training with cosine schedule and proper warmup lead to better validation accuracy, as depicted in Fig. 6, validation mAP achieved by applying cosine learning rate decay outperforms step learning rate schedule at all times in training. Due to the higher frequency of learning rate adjustment, it also suffers less from plateau phenomenon of step decay that validation performance will be stuck for a while until learning rate is reduced
1.5 Synchronized Batch Normalization
In recent years, the need of massive computation resources forces training environments to equip multiple devices (usually GPUs) to accelerate training. Despite handling different hyper-parameters in response to larger batch sizes during training, Batch Normalization [10] is drawing the attention of multi-device users due to the implementation details. Although the typical implementation of Batch Normalization working on multiple devices (GPUs) is fast (with no communication overhead), it inevitably reduces the size of batch size and causing slightly different statistics during computation, which potentially degraded performance. This is not a significant issue in some standard vision tasks such as ImageNet classification (as the batch size per device is usually large enough to obtain good statistics). However, it hurts the performance in some tasks with a small batch size (e.g., 1 per GPU). Recently, Peng et al. [14] has proved the importance of synchronized batch normalization in object detection. In this work, we review the importance of Synchronized Batch Normalization with YOLOv3 [16] to evaluate the impacts of relatively smaller batch-size on each GPU as training image shape is significantly larger than image classification tasks.
1.6 Random shapes training for single-stage object detection networks
Natural training images come in various shapes. To fit memory limitation and allow simpler batching, many single-stage object detection networks are trained with fixed shapes [12, 15]. To reduce risk of overfitting and to improve generalization of network predictions, we follow the approach of random shapes training as in Redmon et al. [16]. More specifically, a mini-batch of N training images is resized to N × 3 × H × W, where H and W are multipliers of network stride. For example, we use H = W ∈ {320, 352, 384, 416, 448, 480, 512, 544, 576, 608} for YOLOv3 training given the stride of feature map is 32.
2. Experiments
In order to compare proposed tweaks for object detection, we pick up one popular object detection framework from single and multiple stage pipelines, respectively. YOLOv3 [16] is famous for its efficiency and good accuracy. Faster-RCNN [17] is one of the most adopted detection framework and foundation of many others variants. Therefore in this paper, we use YOLOv3 and Faster-RCNN as representatives to conduct experiments. Note that in order to remove side effects of test time tricks, we always report single scale, single model results with standard Nonmaximum Suppression implementation. We do not use external training image or labels in our experiments.
2.1 Incremental trick evaluation on Pascal VOC
Pascal VOC is the most common dataset for benchmarking object detection models [3, 12, 15], we use Pascal VOC 2007 trainval and 2012 trainval for training and 2007 test set for validation. The results are reported in mean average precision defined in Pascal VOC development kit [2]. For YOLOv3 models, we consistently validate mean average precision (mAP) at 416 × 416 resolution. If random shape training is enabled, YOLOv3 models will be fed with random resolutions from 320×320 to 608×608 with 32×32 increments, otherwise they are always trained with fixed 416 × 416 input data. Faster RCNN models take arbitrary input resolutions. In order to regulate training memory consumption, the shorter sides of input images are resized to 600 pixels while ensuring the longer side in smaller than 1000 pixels. Training and validation of Faster-RCNN models follow the same preprocessing steps, except that training images have chances of 0.5 to flip horizontally as additional data augmentation. The incremental evaluations of YOLOv3 and Faster-RCNN with our bags of freebies (BoF) are detailed in Table. 3 and Table. 4, respectively
For YOLOv3, we first notice that data augmentation contributed nearly 16% to the baseline mAP, suggesting that single-stage object detection networks rely heavily on assistance of data augmentation to create unseen patches. In terms of the training tricks we mentioned in the previous section, stacking Synchronized BatchNorm, Random Training, cosine learning rate schedule, Sigmoid label smoothing and detection mixup continuously improves validation performance, up to 3.43%, achieving 83.68% single model single scale mAP.
For Faster-RCNN, one obvious difference compared with YOLOv3 results is that disabling data augmentation only introduced a minimal 0.16% mAP loss. This phenomena is indicating that sampling based proposals can effectively replace random cropping which is heavily used in single stage object detection training pipelines. Second, incremental mAPs show strong confidence that the proposed tricks can effectively improve model performance, with a significant 3.55% gain
It is challenging to achieve mAP higher than 80% with out external training data on Pascal VOC [17, 12, 20]. However, we managed to achieve up to 3.5% mAP gain on both YOLOv3 and Faster-RCNN models, reaching as high as 83.68% single model single scale evaluation results.
2.2 Bag of Freebies on MS COCO.
To further evaluate effectiveness of bag of freebies on larger dataset, we benchmark on MS COCO [11] in order to validate the generalization of our bags of tricks in this work. COCO 2017 is 10 times larger than Pascal VOC and contains much more tiny objects compared with PASCAL VOC. We use similar training and validation settings as Pascal VOC, except that Faster-RCNN models are resized to 800 × 1300 pixels in response to smaller objects. The results are shown in Table. 5.
In summary, our proposed bags of freebies boost FasterRCNN models by 1.1% and 1.7% absolute mean AP over existing state-of-the-art implementations [5] with ResNet 50 and 101 base models, respectively. Following evaluation resolution reports in [16], we list YOLOv3 evalution results using 320, 416, 608 resolutions to compare performance at different scales. While at 608 × 608 our model outperforms baseline [16] by 4.0% absolute mAP, at lower resolutions, this gap is more significantly 5.4% absolute mAP, almost 20% better than baseline. Note that all these results are obtained by generating better weights in a fully compatible inference model, i.e., all these achievements are free lunch during inference. We also notice that by adopting bag of freebies during training, we successfully uplift YOLOv3 performance to the same level as state-of-the-art Faster-RCNN [5] (37.0 vs 36.5) while preserves faster inference speed as part of single stage model benefits.
Mean AP is the average over 80 categories, which may not reflect the per category performance. We plot per category AP changes of YOLOv3 and Faster-RCNN models before and after our BoF in Fig. 7 and Fig. 8 respectively. Except rare cases, we can see the majority of categories benefit from bag of freebies training tricks.
2.3 Impact of mixup on different phases of training detection network
Mixup can be applied in two phases of object detection networks: 1) pre-training classification network backbone with traditional mixup [8, 24]; 2) training detection networks using proposed visually coherent image mixup for object detection. Since we do not freeze weights pre-trained on ImageNet, both training phase can affect final detection models. We compare the results using Darknet 53- layer based YOLO3 [16] implementation and ResNet101 [7] based Faster-RCNN [17]. Final validation results are listed in Table. 6 and Table. 7, respectively. While the results prove the consistent improvements by adopting mixup to either training phases, interestingly it is also notable that applying mixup in both phases can produce more significant gains. For example, employing either pre-training mixup or detection mixup has nearly 0.2% absolute mAP improvement over baseline. By combining both mixup techniques, we achieve 1.2% performance boost. We expect by applying mixup in both training phases, shallow layers of networks are receiving statistically similar inputs, resulting in less perturbations for low level filters.
3. Conclusion
In this paper, we propose a bag of training enhancements significantly improved model performances while introducing zero overhead to the inference environment.
Our empirical experiments of YOLOv3 [16] and Faster-RCNN [17] on Pascal VOC and COCO datasets show that the bag of tricks are consistently improving object detection models.
By stacking all these tweaks, we observe no signs of degradation of any level and suggest a wider adoption to future object detection training pipelines.
These freebies are all training time modifications, and therefore only affect model weights without increasing inference time or change of network structures.
All existing and future work will be included as part of open source GluonCV repository [1].