Simple In-place Data Augmentation for Surveillance Object Detection

Munkh-Erdene Otgonbold1,3 Ganzorig Batnasan1 Munkhjargal Gochoo1,2

{ }

Paper: https://arxiv.org/pdf/2404.11226v1

Code:

Motivation, Objectives and Related Works

Motivation

To improve model performance in traffic monitoring tasks with limited labeled samples.

Objectives

Propose a straightforward augmentation technique tailored for object detection datasets, specifically designed for stationary camera-based applications.

==> Placing objects in the same positions as the originals to ensure their effectiveness.

Related Works

Vision algorithms
- 1. Detecting and tracking vehicles and pedestrians [8].
  2. Re-identification [16].
Object detection
- 1. Face recognition [20].
  2. Robotic grasping [18].
  3. Human interaction [11].
Augmentation - positioning objects at different locations within pre-existing scenes
- 1. [5]
  2. [13]
  3. [9].
- Zoph et al. [29] delve into the enhancement of generalization performance for detection models through the exploration of learned, specialized data augmentation policies. With meticulous curation, they assembled subsets of images from the COCO dataset [19], ranging in size from 5000 to 23000 images. The researchers observed a notable improvement in detection accuracy of more than +2.3 mAP across different ResNet backbones, resulting in mAP values ranging from 39.0 to 42.1. Furthermore, when applied to a distinct detection model featuring an AmoebaNet-D backbone [21], the method achieved a remarkable increase of +1.5% mAP, attaining a state-of-the-art accuracy of 50.7 mAP.
- Copy-Paste augmentation [10] for instance segmentation, revealing that random object pasting yields significant performance improvements over previous methods focusing on contextual modeling.
- Dvornik et al. [4] proposed a data augmentation method consisting of two main steps: first, utilizing bounding box annotations to model visual context and train a CNN to predict object presence or absence; second, employing the trained context model to generate new object locations. Their approach, applied to a subset of the Pascal VOC’12 dataset [7], involved training a single multiple-category object detector with significantly more labeled data, resulting in a 1.3% average improvement over baseline across various categories Cubuk et al. [3] propose a simplified approach to automated augmentation strategies, eliminating the need for a separate search phase. They apply their method across CIFAR-10/100, SVHN, ImageNet, and COCO datasets [19]. Notably, using EfficientNet-B7, they achieve a 1.0% increase in accuracy over baseline augmentation and a 0.6% improvement over AutoAugment on the ImageNet dataset. Behpour et al. [2] offers a game-theoretic perspective on data augmentation in object detection, seeking optimal adversarial perturbations of ground truth data to enhance testtime performance. They demonstrate significant improvements of approximately 16%, 5%, and 2% respectively on the ImageNet[22], Pascal VOC [6], and MS-COCO [19] object detection tasks compared to leading data augmentation methods. Kisantal et al. [17] tackle the performance gap in object detection between small and large objects by employing Mask-RCNN on the MS COCO dataset [19]. Their approach involves oversampling images containing small objects and augmenting them through the repeated copypasting of small objects. This method leads to a significant 9.7% relative enhancement in instance segmentation and a 7.1% improvement in object detection of small objects compared to the current state-of-the-art performance on MS COCO [19]. Shao et al. [23] address the limitations of SRe2L’s ”local-match-global” matching method by introducing ”generalized matching” through G-VBSM. Their approach surpasses state-of-the-art methods by 3.9%, 6.5%, and 10.1% on CIFAR-100 [1], Tiny-ImageNet [26], and ImageNet-1k [22], respectively, demonstrating superior performance across small and large-scale datasets. These methods have achieved favorable results through the integration of traditional augmentation methods with other techniques. While effective in enhancing results, the augmentation method poses significant computational demands due to the increased volume of data. However, these augmentation methods often struggle to preserve the realism of the images.

Model

Idea

To enhance the object’s impact on the model, we increased the number of objects within the image rather than augmenting the number of images.
We selected and placed objects from the same camera input image that could be placed without overlapping with original and previously selected objects.

Steps

Architecture

Data Processing

Few-shot Data Sampling Framework

Data Augmentation

Models

Training Strategy

Experimental Results

Dataset

Fisheye8k.
UA-DETRAC.

Metrics

Experimental Results

Key Takeaways

Object detection concentrates on identifying instances of objects from a specified category. By placing training objects at unrealistic positions, implicitly modeling context becomes difficult, and detection accuracy drops substantially.

References

[5] Debidatta Dwibedi, Ishan Misra, and Martial Hebert. Cut, paste and learn: Surprisingly easy synthesis for instance detection. In 2017 IEEE International Conference on Computer Vision (ICCV). IEEE, 2017.
[9] Georgios Georgakis, Arsalan Mousavian, Alexander Berg, and Jana Kosecka. Synthesizing training data for object detection in indoor scenes. In Robotics: Science and Systems XIII. Robotics: Science and Systems Foundation, 2017.
[11] Georgia Gkioxari, Ross Girshick, Piotr Dollar, and Kaiming He. Detecting and recognizing human-object interactions. In 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. IEEE, 2018.
[13] Ankush Gupta, Andrea Vedaldi, and Andrew Zisserman. Synthetic data for text localisation in natural images. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 2016.
[18] Sulabh Kumra and Christopher Kanan. Robotic grasp detection using deep convolutional neural networks. In 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 2017.
[20] Rajeev Ranjan, Vishal M. Patel, and Rama Chellappa. Hyperface: A deep multi-task learning framework for face detection, landmark localization, pose estimation, and gender recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 41(1):121–135, 2019.

- n2 n0
- θ
- ŷ
- ∑
- ȳ

Simple In-place Data Augmentation for Surveillance Object Detection

Motivation, Objectives and Related Works

Motivation

Objectives

Related Works

Model

Idea

Steps

Architecture

Data Processing

Few-shot Data Sampling Framework

Data Augmentation

Models

Training Strategy

Experimental Results

Dataset

Metrics

Experimental Results

Key Takeaways

References

About Me: