PermaTrack -
{Tracking-by-detection}
{Tracking-by-detection}
0) Motivation, Object and Related works:
Motivation:
Objectives:
When a target object, for example, a person, becomes invisible in a frame, the target may actually stay in the frame but fully occluded by some other objects and its location can be approximated. PermaTrack keeps track of the location of these invisible objects and once they appear visible again, they can be associated using their latest position during the invisible period.
PermaTrack is built based on the architecture of CenterTrack. A convolutional gated recurrent unit (ConvGRU) is inserted into the model as shown in Fig. 11 to convert a frame-by-frame feature map Fᵗ into a state matrix Mᵗ. This state matrix represents the entire history of previously seen objects. Predictions, i.e., heatmap, size map, displacement map, and additional visibility map introduced in PermaTrack, are made from this state matrix instead of the frame-by-frame feature map. The visibility map, generated by an additional visibility head, is binary, i.e., taking the value of either 0 or 1. Although PermaTrack is capable of keeping track of all objects disregarding their visibility, only visible objects are reported as the output to avoid being penalized during the performance evaluation since most real-world datasets do not provide annotation of invisible objects.
The lack of invisible object annotation also causes difficulty in training such a model since there is no supervised signal to tune the visibility head. Tokmakov et al. [14] solved this problem by using a tool, called ParallelDomain (PD), to generate a synthetic dataset in which the accurate ground truths of invisible objects are easily obtained. However, training a model purely on synthetic data could lead to poor performance on real-world data. Therefore, they jointly trained their model on synthetic data and real-world data. For the real-world data, due to the lack of invisible objects’ annotation, they trained the model with sequences of length two only, while for the synthetic data in which all ground truths are available, longer sequences can be used.