TrackFormer -
{Tracking-by-detection}
{Tracking-by-detection}
0) Motivation, Object and Related works:
Motivation:
Objectives:
Meinhardt et al. [16] proposed TrackFormer, which is a multi-object tracking method that is similar to TransTrack but exploits only one Transformer decoder. As shown in Fig. 13, TrackFormer processes a video sequence frame-by-frame using its CNN backbone, followed by a Transformer encoder, and a Transformer decoder. In the first frame, the Transformer decoder takes a set of learnable object queries (shown as white boxes) as an additional input, similar to DETR. The Transformer decoder then predicts a set of tracks. In the next frame, these tracks’ features are then reused as track queries (shown as colored boxes), which are then combined with some of the object queries to produce a joint set of the same size. This joint set replaces the set of object queries to enable the decoder to track existing objects (using the track queries) as well as to detect new objects (using the remaining object queries). This process is then repeated to perform tracking for the whole video sequence. Both TransTrack and TrackFormer achieved state-of-art performance during the time they were published, showing the feasibility of using a Transformer module to solve tracking problems.