Table of Contents
Fetching ...

Walker: Self-supervised Multiple Object Tracking by Walking on Temporal Appearance Graphs

Mattia Segu, Luigi Piccinelli, Siyuan Li, Luc Van Gool, Fisher Yu, Bernt Schiele

TL;DR

Walker is the first self-supervised tracker that learns from videos with sparse bounding box annotations, and no tracking labels, and is the first self-supervised tracker to achieve competitive performance on MOT17, DanceTrack, and BDD100K.

Abstract

The supervision of state-of-the-art multiple object tracking (MOT) methods requires enormous annotation efforts to provide bounding boxes for all frames of all videos, and instance IDs to associate them through time. To this end, we introduce Walker, the first self-supervised tracker that learns from videos with sparse bounding box annotations, and no tracking labels. First, we design a quasi-dense temporal object appearance graph, and propose a novel multi-positive contrastive objective to optimize random walks on the graph and learn instance similarities. Then, we introduce an algorithm to enforce mutually-exclusive connective properties across instances in the graph, optimizing the learned topology for MOT. At inference time, we propose to associate detected instances to tracklets based on the max-likelihood transition state under motion-constrained bi-directional walks. Walker is the first self-supervised tracker to achieve competitive performance on MOT17, DanceTrack, and BDD100K. Remarkably, our proposal outperforms the previous self-supervised trackers even when drastically reducing the annotation requirements by up to 400x.

Walker: Self-supervised Multiple Object Tracking by Walking on Temporal Appearance Graphs

TL;DR

Walker is the first self-supervised tracker that learns from videos with sparse bounding box annotations, and no tracking labels, and is the first self-supervised tracker to achieve competitive performance on MOT17, DanceTrack, and BDD100K.

Abstract

The supervision of state-of-the-art multiple object tracking (MOT) methods requires enormous annotation efforts to provide bounding boxes for all frames of all videos, and instance IDs to associate them through time. To this end, we introduce Walker, the first self-supervised tracker that learns from videos with sparse bounding box annotations, and no tracking labels. First, we design a quasi-dense temporal object appearance graph, and propose a novel multi-positive contrastive objective to optimize random walks on the graph and learn instance similarities. Then, we introduce an algorithm to enforce mutually-exclusive connective properties across instances in the graph, optimizing the learned topology for MOT. At inference time, we propose to associate detected instances to tracklets based on the max-likelihood transition state under motion-constrained bi-directional walks. Walker is the first self-supervised tracker to achieve competitive performance on MOT17, DanceTrack, and BDD100K. Remarkably, our proposal outperforms the previous self-supervised trackers even when drastically reducing the annotation requirements by up to 400x.
Paper Structure (29 sections, 1 theorem, 11 equations, 11 figures, 15 tables, 3 algorithms)

This paper contains 29 sections, 1 theorem, 11 equations, 11 figures, 15 tables, 3 algorithms.

Key Result

theorem thmcountertheorem

The probability of transitioning on a latent node $\textbf{q}_{t+k}^j$ on the reference image $I_{t+k}$ when starting from $\textbf{q}_{t^+}^i$ in $I_t$ and ending on $\textbf{q}_{t}^l$ in $I_t$ along the cycle walk $\mathcal{G}$ is: where $C = \sum_{\textbf{q}_{t+k}^m \in \textbf{Q}_{t+k}} p^{\mathcal{G}}_{X_t| X_{t+k}}(l|m) p^{\mathcal{G}}_{X_{t+k} | X_t^+}(m|i)$ is a normalizing constant.

Figures (11)

  • Figure 1: Supervised MOT requires dense tracking labels (top), i.e. dense detection annotations at each frame and instance labels (shown by coloring boxes by instance ID) across frames. Self-supervised Re-ID assumes dense detection labels and no instance labels (middle). We explore self-supervised MOT in a more practical sparsely-annotated setting (bottom), with sparse detection annotations every $k$ frames (here $k=3$ for illustration purpose) and no instance labels. Fully-unlabeled frames in green.
  • Figure 2: Multi-positive Cycle Consistency. Illustration of the proposed multi-positive cycle consistency on quasi-dense toag (\ref{['ssec:method_cycle']}). We show the cycle walk departing from a given query node (yellow). The multiple positive (negative) nodes are in green (red). For ease of visualization, we only show the high-likelihood transitions.
  • Figure 3: Cluster-wise Forward Assignment. Illustration of the positive (green) and negative (red) forward pseudo-labels for an input query cluster (yellow), deriving from our cluster-wise forward assignment strategy described in \ref{['ssec:method_forward']}.
  • Figure 4: Self-supervised mot under different annotation sparsity rates (FPS) during training. We compare video-level (Walker; QD-Walker) and frame-level (QDTrack-S) self-supervision. $\dagger$: reference QDTrack fully-supervised at 20 FPS.
  • Figure 5: We analyze 5 frames spaced by 0.2 seconds of the DanceTrack sequence 0058. Compared to image-level self-sup. (QDTrack-S fischer2022qdtrack), Walker effectively utilizes the temporal information to reduce ID switches (blue). Correctly tracked boxes in green.
  • ...and 6 more figures

Theorems & Definitions (2)

  • theorem thmcountertheorem
  • proof : Proof of \ref{['app:lemma:transition_probability']}