Table of Contents
Fetching ...

Video Diffusion Models Excel at Tracking Similar-Looking Objects Without Supervision

Chenshuang Zhang, Kang Zhang, Joon Son Chung, In So Kweon, Junmo Kim, Chengzhi Mao

TL;DR

This work shows that pretrained video diffusion models harbor motion representations in their high-noise denoising stages that are highly effective for tracking visually similar objects without supervision. By extracting motion cues (R_m) and combining them with appearance cues (R_a) into a fused representation (R_f), and applying label propagation, the authors achieve state-of-the-art pixel-level tracking on DAVIS-2017 and substantially better performance on datasets with similar-looking objects. TED demonstrates that motion understanding can emerge from generative models without dedicated tracking training, highlighting a new use case for diffusion models and suggesting avenues for efficiency improvements. The findings broaden the scope of diffusion-model applications beyond generation to robust, self-supervised perception tasks.

Abstract

Distinguishing visually similar objects by their motion remains a critical challenge in computer vision. Although supervised trackers show promise, contemporary self-supervised trackers struggle when visual cues become ambiguous, limiting their scalability and generalization without extensive labeled data. We find that pre-trained video diffusion models inherently learn motion representations suitable for tracking without task-specific training. This ability arises because their denoising process isolates motion in early, high-noise stages, distinct from later appearance refinement. Capitalizing on this discovery, our self-supervised tracker significantly improves performance in distinguishing visually similar objects, an underexplored failure point for existing methods. Our method achieves up to a 6-point improvement over recent self-supervised approaches on established benchmarks and our newly introduced tests focused on tracking visually similar items. Visualizations confirm that these diffusion-derived motion representations enable robust tracking of even identical objects across challenging viewpoint changes and deformations.

Video Diffusion Models Excel at Tracking Similar-Looking Objects Without Supervision

TL;DR

This work shows that pretrained video diffusion models harbor motion representations in their high-noise denoising stages that are highly effective for tracking visually similar objects without supervision. By extracting motion cues (R_m) and combining them with appearance cues (R_a) into a fused representation (R_f), and applying label propagation, the authors achieve state-of-the-art pixel-level tracking on DAVIS-2017 and substantially better performance on datasets with similar-looking objects. TED demonstrates that motion understanding can emerge from generative models without dedicated tracking training, highlighting a new use case for diffusion models and suggesting avenues for efficiency improvements. The findings broaden the scope of diffusion-model applications beyond generation to robust, self-supervised perception tasks.

Abstract

Distinguishing visually similar objects by their motion remains a critical challenge in computer vision. Although supervised trackers show promise, contemporary self-supervised trackers struggle when visual cues become ambiguous, limiting their scalability and generalization without extensive labeled data. We find that pre-trained video diffusion models inherently learn motion representations suitable for tracking without task-specific training. This ability arises because their denoising process isolates motion in early, high-noise stages, distinct from later appearance refinement. Capitalizing on this discovery, our self-supervised tracker significantly improves performance in distinguishing visually similar objects, an underexplored failure point for existing methods. Our method achieves up to a 6-point improvement over recent self-supervised approaches on established benchmarks and our newly introduced tests focused on tracking visually similar items. Visualizations confirm that these diffusion-derived motion representations enable robust tracking of even identical objects across challenging viewpoint changes and deformations.

Paper Structure

This paper contains 21 sections, 6 equations, 7 figures, 8 tables.

Figures (7)

  • Figure 1: Video label propagation on similar-looking objects. State-of-the-art self-supervised trackers, such as DIFT tang2023emergent, CRW jabri2020space and Spa-then-Temp li2023spatial, often struggle when multiple objects look similar in a video. This failure is due to their exclusive reliance on appearance features. By dissecting and repurposing pretrained video diffusion models, we construct a feature that captures intra-frame motions in videos, allowing us to correctly track similar-looking objects, such as the deer highlighted by the green box in (c). In this figure, the green and red masks represent segmentation maps of different objects, while the blue, green, and red boxes highlight the ground truth regions, correctly predicted regions, and incorrectly predicted regions, respectively.
  • Figure 2: Our approach successfully tracks objects with identical appearances. We conduct a controlled study, that we perform object label propagation on videos featuring two identical-looking and independently moving balls, with frames and their ground truth labels shown in (a) and (b). State-of-the-art methods jabri2020spaceli2023spatialtang2023emergent fail to distinguish these two balls, leading to incorrect predictions (c). In contrast, our approach accurately track both balls despite their identical appearance (d).
  • Figure 3: Framework. Our work tracks objects via video label propagation, which transfers ground truth label of the first frame to subsequent frames. As video diffusion models typically have a maximum input length, we first divide the long video into overlapping video windows (see (a)). For each window, we use video diffusion models to extract frame representations that capture rich inter-frame motion features(see (b)). Specifically, our method uses the 3D UNet backbone that can process the entire video sequence along the temporal axis. Finally, to predict the label for a query pixel $i$ in the target frame (${\bf R}^t$), we follow prior studies to aggregate the labels of its most similar pixels in previous frames (see (c); details in Section \ref{['subsec:tracking_steps']}). We term our method Temporal Enhanced Diffusion tracking framework (TED). Experiments demonstrate that our TED improves tracking performance across diverse video scenarios, including those with similar-looking objects.
  • Figure 4: Predictions for pixel-level object tracking. We evaluate TED on the video label propagation task, comparing its predicted segmentation maps with those from state-of-the-art methods li2023spatialtang2023emergent. Our TED consistently outperforms both methods li2023spatialtang2023emergent on DAVIS (Figure a-d) and YouTube-Similar (Figure e-f) datasets, aligning with Table \ref{['tab:main_results']}. Notably, our TED delivers more accurate predictions in scenarios with complex deformations (a) and viewpoint changes (b), while Spa-then-Temp li2023spatial and DIFT tang2023emergent struggle with tracking completeness, e.g., the missing arm in (a). Our TED also achieves superior tracking in multi-object scenarios, such as interacting objects (c-d) and similar-looking objects (e-f). In contrast, Spa-then-Temp li2023spatial and DIFT tang2023emergent have mislabeling issues, such as incorrect labels for the gun in (d) and misaligned labels for sheep in the background (f). These results show that our TED significantly improves tracking performance, highlighting the superiority of our motion-aware representations in tracking. (Best viewed when zoomed in.)
  • Figure 5: Tracking results under different denoising steps. We evaluate tracking performance using model inputs ${\bf X}^\tau$ at various denoising steps $\tau$, where larger $\tau$ indicates more noise (see (b)). The performance of appearance features ${\bf R}_a$ degrade significantly as $\tau$ increases, while our motion feature ${\bf R}_m$ maintains high tracking accuracy even with a large $\tau$. Notably, ${\bf R}_m$ peaks at $\tau$=600 on Youtube-Similar and $\tau$=900 on Kubric-Similar, where appearance cues are almost available. These results reveal that video diffusion models can learn object motions from highly noisy inputs, enabling effective, motion-aware tracking.
  • ...and 2 more figures