Table of Contents
Fetching ...

SPAMming Labels: Efficient Annotations for the Trackers of Tomorrow

Orcun Cetintas, Tim Meinhardt, Guillem Brasó, Laura Leal-Taixé

TL;DR

The paper tackles the high cost of annotating long, multi-object tracking sequences by introducing SPAM, a video label engine that fuses synthetic pre-training on MOTSynth, pseudo-labeling on real data, and active learning guided by a graph-hierarchy model to label detections and associations with minimal human input. A detection-first, graph-based labeling framework leverages temporal dependencies to propagate decisions across time, while an uncertainty-driven annotator intervention focuses only on the hardest cases. Empirical results show SPAM can reach near-ground-truth tracking performance using as little as 3.3% of manual annotations on MOT17 (and low budgets on MOT20 and DanceTrack) and that retraining trackers with SPAM-generated pseudo-labels yields substantial gains without manual labeling. Overall, SPAM demonstrates that synthetic pretraining, self-training with pseudo-labels, and hierarchical graph reasoning can dramatically reduce labeling costs and scale up tracking datasets for data-hungry trackers, with open-source models and code provided.

Abstract

Increasing the annotation efficiency of trajectory annotations from videos has the potential to enable the next generation of data-hungry tracking algorithms to thrive on large-scale datasets. Despite the importance of this task, there are currently very few works exploring how to efficiently label tracking datasets comprehensively. In this work, we introduce SPAM, a video label engine that provides high-quality labels with minimal human intervention. SPAM is built around two key insights: i) most tracking scenarios can be easily resolved. To take advantage of this, we utilize a pre-trained model to generate high-quality pseudo-labels, reserving human involvement for a smaller subset of more difficult instances; ii) handling the spatiotemporal dependencies of track annotations across time can be elegantly and efficiently formulated through graphs. Therefore, we use a unified graph formulation to address the annotation of both detections and identity association for tracks across time. Based on these insights, SPAM produces high-quality annotations with a fraction of ground truth labeling cost. We demonstrate that trackers trained on SPAM labels achieve comparable performance to those trained on human annotations while requiring only $3-20\%$ of the human labeling effort. Hence, SPAM paves the way towards highly efficient labeling of large-scale tracking datasets. We release all models and code.

SPAMming Labels: Efficient Annotations for the Trackers of Tomorrow

TL;DR

The paper tackles the high cost of annotating long, multi-object tracking sequences by introducing SPAM, a video label engine that fuses synthetic pre-training on MOTSynth, pseudo-labeling on real data, and active learning guided by a graph-hierarchy model to label detections and associations with minimal human input. A detection-first, graph-based labeling framework leverages temporal dependencies to propagate decisions across time, while an uncertainty-driven annotator intervention focuses only on the hardest cases. Empirical results show SPAM can reach near-ground-truth tracking performance using as little as 3.3% of manual annotations on MOT17 (and low budgets on MOT20 and DanceTrack) and that retraining trackers with SPAM-generated pseudo-labels yields substantial gains without manual labeling. Overall, SPAM demonstrates that synthetic pretraining, self-training with pseudo-labels, and hierarchical graph reasoning can dramatically reduce labeling costs and scale up tracking datasets for data-hungry trackers, with open-source models and code provided.

Abstract

Increasing the annotation efficiency of trajectory annotations from videos has the potential to enable the next generation of data-hungry tracking algorithms to thrive on large-scale datasets. Despite the importance of this task, there are currently very few works exploring how to efficiently label tracking datasets comprehensively. In this work, we introduce SPAM, a video label engine that provides high-quality labels with minimal human intervention. SPAM is built around two key insights: i) most tracking scenarios can be easily resolved. To take advantage of this, we utilize a pre-trained model to generate high-quality pseudo-labels, reserving human involvement for a smaller subset of more difficult instances; ii) handling the spatiotemporal dependencies of track annotations across time can be elegantly and efficiently formulated through graphs. Therefore, we use a unified graph formulation to address the annotation of both detections and identity association for tracks across time. Based on these insights, SPAM produces high-quality annotations with a fraction of ground truth labeling cost. We demonstrate that trackers trained on SPAM labels achieve comparable performance to those trained on human annotations while requiring only of the human labeling effort. Hence, SPAM paves the way towards highly efficient labeling of large-scale tracking datasets. We release all models and code.
Paper Structure (22 sections, 1 equation, 8 figures, 5 tables)

This paper contains 22 sections, 1 equation, 8 figures, 5 tables.

Figures (8)

  • Figure 1: Overview of the SPAM model. We first generate a set of detection candidates with our detector. Hierarchical GNNs then classify these candidates into valid and invalid objects via node classification, and assign identities through edge classification.
  • Figure 2: Overview of the SPAM training and annotation pipeline. (a) Initial model training on synthetic data. (b) Application of SPAM to generate pseudo-labels without incurring manual annotation costs on a real dataset, followed by self-training on pseudo-labels. (c) Real dataset labeling using pseudo-labels and an uncertainty-based active learning approach.
  • Figure 3: Our graph-based labeling pipeline begins with the selection of nodes for annotation. For each node to be annotated, the annotator could be asked to validate the detection, improve the localization by refining the box or perform association.
  • Figure 3: Performance boost obtained by our model when retraining with its own pseudo-labels incurring no manual annotation cost.
  • Figure 4: Analysis of performance gap between training a model on synthetic and real data for the three most common tracking components: detection, association, re-identification.
  • ...and 3 more figures