CenterTrack -
{Tracking-by-detection}
{Tracking-by-detection}
0) Motivation, Object and Related works:
Motivation:
Objectives:
based on an object detector called CenterNet
Different from other detectors that predict the top-left position of bounding boxes together with the size, CenterNet predicts the bounding boxes’ center instead. CenterNet takes a frame of video as input and generates a low-resolution heatmap representing the chance to find an object’s center and a size map representing the width and height of an object at each location. Each local maximum on the heatmap, called peak, is considered the center of a detected object.
As shown in Fig. , Zhou et al. [13] modified the architecture of CenterNet by adding the previous frame (three channels) and its heatmap (one channel) as additional input apart from the current frame. The additional inputs allow CenterTrack to know where detected objects in the previous frame are. They also added another branch to predict objects’ displacement between the two frames (two channels) as output. The displacement is used to perform association. In particular, each detection in the current frame at position p is simply associated with the closest unmatched detection in the previous frame at the displacement-compensated location p- d_p, where d_p is the predicted displacement of the predicted object at p.
CenterTrack does not require any video annotation to train it; however, it can be trained on a set of static images with detection ground truth. Given an image and its detection ground truth, the ground truth heatmap is generated by placing a normal distribution at the location of each object. The image is then randomly scaled and shifted to generate a simulated previous frame, in which its detection ground truth can be computed from the scaling and shifting factors. Once the objects’ locations in the current and simulated previous frame are known, the displacement can be obtained. These ground truths are then used to train CenterTrack with a pair of images as input. Due to the close similarity in the architecture, the weights of a pre-trained CenterNet can be exploited to initialize the corresponding weights in CenterTrack, while the other additional weights are randomly initialized. This training strategy allows CenterTrack to be jointly trained on detection and tracking tasks on a large-scale, static image dataset such as CrowdHuman dataset.