Few-shot Data Sampling
A Aboah, B Wang, U Bagci, Y Adu-Gyamfi
{YOLOv8, Few-shot data sampling, Test-time augmentation (TTA) }
Code: https://github.com/aboah1994/few-shotVideo-Data-Sampling.git.
A Aboah, B Wang, U Bagci, Y Adu-Gyamfi
{YOLOv8, Few-shot data sampling, Test-time augmentation (TTA) }
Code: https://github.com/aboah1994/few-shotVideo-Data-Sampling.git.
Helmet usage violations continue to be a significant problem.
Automatic helmet detection systems have been proposed and implemented using computer vision techniques.
Real-time implementation of such systems is crucial.
Proposes a robust real-time helmet violation detection system.
A unique data processing strategy, referred to as few-shot data sampling, to develop a robust model with fewer annotations
A single-stage object detection model, YOLOv8 (You Only Look Once Version 8), for detecting helmet violations in real-time from video frames.
[15, 2, 13, 14] used color and texture-based features to detect helmets in real-time and reported an accuracy rate of 89.5%.
[16] used a Convolutional Neural Network (CNN) trained on a large dataset of helmet and non-helmet images and reported a high accuracy rate of 97.5%.
[3, 4, 9, 13].
[6] used YOLOv3, for helmet enforcement in real-time, reported an accuracy rate of 96.2%, and processing time of less than 30 milliseconds per frame.
[5] proposed a real-time helmet enforcement system using a combination of color and texture-based features and a deep neural network to detect helmets in real-time and reported an accuracy rate of 95.6% and a processing time of less than 100 milliseconds per frame.
Develop a realtime helmet violation detection system that is robust to varying weather conditions and time of day.
”Few-shot data sampling technique”: Developing a robust helmet detection model with fewer annotations.
Selecting a small but representative number of images from a large dataset using our developed algorithms.
Applying data augmentation techniques to generate additional images for training.
By using this technique, we are able to develop a robust helmet detection model with fewer annotations.
Two main data pre-processing steps:
A few-shot data sampling framework: select the best representative set of data for training.
Data augmentation: increase the variety of the training data.
Purpose:
Missing annotations as illustrated in Fig. 4 ==> a few-shot data sampling framework was developed.
This framework was designed to help select the most representative frames of the entire dataset and minimize the need for re-annotation of all 20,000 frames.
Three primary steps:
Determining the background in each video: (1) Randomly select frames within a 10-second period; (2) Compute the median of 60 percent of all frames in the sample. ==> Help negate the impact of short-term video resolution changes such as zooms, and pixelation.
Using "Algorithm 1" to categorize the videos according to the time of day and weather conditions such as day, night, and fog. ==> Ensure a balanced representation of all video types in the training data.
A frame-sampling algorithm (Algorithm 2) selects more frames from video types that were underrepresented, as identified.
Used to categorize videos according to the time of day and weather conditions.
The proposed algorithm takes the estimated video background and calculates the frequency of each pixel. If the maximum frequency corresponds to a pixel value less than 150, the algorithm classifies the image as night; otherwise, the algorithm classifies the image as day or foggy.
To distinguish between daytime and foggy videos, the skewness of the image frequencies is computed. The algorithm classifies the video as foggy if the absolute skewness is close to zero.
The frequency distribution of the day, night, and fog images are shown in Fig. 5.
The algorithm aims to select a balanced set of frames from each video category by considering the total number of videos and their frame rates.
Iterate through each video category: A loop iterates through each category (1, 2, 3, ..., n) of videos.
Calculate sample rate: For each category, the algorithm calculates a sample rate. The calculation involves dividing the total number of desired frames (nframes) by the sum of the product of frames per second (fps) and the number of videos (videosxfps) across all categories.
Select frames: Within each category, the algorithm selects frames from each video at the calculated sample rate. How exactly frames are chosen within a video at this point is not specified in the limited view of the algorithm provided in the figure.
End loop: After iterating through all video categories, the algorithm returns the selected frames (selected_frames).
Image flipping: Flipping the image horizontally was done to aid the model to learn to detect helmets from both sides of the motorcycle.
Rotation: Rotation was applied to augment the data by changing the viewpoint angle of the helmet.
Scaling: Scaling was used to change the size of the helmet in the image, which can help the model learn to detect helmets of different sizes.
Cropping: Cropping of images was done to simulate the effect of occlusion, so that the model can learn to detect helmets even when they are partially obscured.
Blurring: Blurring of images were carried out to help the model learn to detect helmets under poor lighting conditions.
Color manipulation: We adjusted the brightness, contrast, and saturation of the image to help the model learn to detect helmets in different lighting conditions.
YOLOv5.
YOLOv7.
YOLOv8.
Test Time Augmentation (TTA) involves applying data augmentation techniques, such as rotation, flipping, or cropping, to the test data and then making predictions on each augmented version of the test data.
The final prediction is then made by averaging the predictions made on the augmented versions of the test data.
TTA can be computationally expensive. However, it can be implemented efficiently by using parallel processing or by batching the augmented data.
All models were trained on an NVIDIA GeForce RTX 3090 GPU using 4,500 training examples.
The dataset was divided in a ratio of 0.7:0.3 for training and validation respectively.
The test dataset was provided separately by the organizers of the competition.
To prevent the model from overfitting frames with high similarity, we employed the Semantic Clustering by Adopting Nearest Neighbors (SCAN) algorithm [10] to eliminate frames with high similarity (Fig 6).
All models were trained for 400 epochs with a batch size of 16 and an image size of 832x832.
2023 NVIDIA AI CITY CHALLENGE, Track 5 (motorcyclists): 100 videos for training; 100 videos for testing; 20s length; 10 fps; 1920x1080 pixels.
mAP
Precision
Recll
Solving day/ night, and weather conditions problems in the traffic dataset.
Solving missing annotations as Fig. 4.
n2 n0
θ
ŷ
∑
ȳ