TransTrack -
{Tracking-by-detection}
{Tracking-by-detection}
0) Motivation, Object and Related works:
Motivation:
Objectives:
Sun et al. [15] proposed a Transformer-based model [24] for multi-object tracking named TransTrack. The architecture of TransTrack, which is extended from Detection Transformer (DETR) [25], is shown in Fig..
Firstly, a video frame is fed into a CNN backbone to compute a feature map. Secondly, the feature maps of the current frame t and the previous frame t-1 are linearly projected and reshaped into a sequence of tokens.
Next, this sequence is processed by a Transformer encoder [24] to produce as output another sequence, in which its feature representation has been enhanced. In DETR, given a set of learnable object queries, this output sequence is then further processed by a Transformer decoder to produce object features.
TransTrack, however, exploits another Transformer decoder that takes the object features from the previous frame as an additional input to predict the features of existing tracks. The object features and track features are then matched by a matching head to produce tracking results. Similar to other tracking methods, an unmatched track is set as inactive. A track is killed if it is inactive for a number of consecutive frames.