SPAMming Labels: Efficient Annotations for the Trackers of Tomorrow
Orcun Cetintas, Tim Meinhardt, Guillem Brasó, Laura Leal-Taixé
TL;DR
The paper tackles the high cost of annotating long, multi-object tracking sequences by introducing SPAM, a video label engine that fuses synthetic pre-training on MOTSynth, pseudo-labeling on real data, and active learning guided by a graph-hierarchy model to label detections and associations with minimal human input. A detection-first, graph-based labeling framework leverages temporal dependencies to propagate decisions across time, while an uncertainty-driven annotator intervention focuses only on the hardest cases. Empirical results show SPAM can reach near-ground-truth tracking performance using as little as 3.3% of manual annotations on MOT17 (and low budgets on MOT20 and DanceTrack) and that retraining trackers with SPAM-generated pseudo-labels yields substantial gains without manual labeling. Overall, SPAM demonstrates that synthetic pretraining, self-training with pseudo-labels, and hierarchical graph reasoning can dramatically reduce labeling costs and scale up tracking datasets for data-hungry trackers, with open-source models and code provided.
Abstract
Increasing the annotation efficiency of trajectory annotations from videos has the potential to enable the next generation of data-hungry tracking algorithms to thrive on large-scale datasets. Despite the importance of this task, there are currently very few works exploring how to efficiently label tracking datasets comprehensively. In this work, we introduce SPAM, a video label engine that provides high-quality labels with minimal human intervention. SPAM is built around two key insights: i) most tracking scenarios can be easily resolved. To take advantage of this, we utilize a pre-trained model to generate high-quality pseudo-labels, reserving human involvement for a smaller subset of more difficult instances; ii) handling the spatiotemporal dependencies of track annotations across time can be elegantly and efficiently formulated through graphs. Therefore, we use a unified graph formulation to address the annotation of both detections and identity association for tracks across time. Based on these insights, SPAM produces high-quality annotations with a fraction of ground truth labeling cost. We demonstrate that trackers trained on SPAM labels achieve comparable performance to those trained on human annotations while requiring only $3-20\%$ of the human labeling effort. Hence, SPAM paves the way towards highly efficient labeling of large-scale tracking datasets. We release all models and code.
