JDE -
{Tracking-by-detection}
{Tracking-by-detection}
0) Motivation, Object and Related works:
Motivation:
Objectives:
combines the detector and visual appearance model into a shared model.
As shown in Fig. , some other trackers, especially the tracking-by-detection approach, exploit two separate models; one for object detection and another for generating visual appearance features, which are trained separately on different tasks. JDE, however, exploits a single-stage object detector that is jointly trained on detection and embedding generation tasks; consequently, it generates not only detected bounding boxes but also visual appearance features in one forward pass.
JDE uses an architecture with Feature Pyramid Network (FPN) to handle objects of various scales. Apart from the box classification head and box regression head, JDE also has an embedding head to generate a dense embedding map representing the visual appearance features of detected objects. During the training, a cross-entropy loss is computed at the box classification head, a smooth-L1 loss at the box regression head, and a triplet loss at the embedding head. The first two losses are for training JDE to correctly locate and classify objects, while the third is to obtain features to distinguish intra-class objects. These losses are then fused together and used to guide the optimization process.
During the association step, JDE computes the motion affinity and appearance affinity matrices between all detections and all existing tracklets. These two matrices are combined to compute a cost matrix. Then Hungarian algorithm is applied once for each frame.