Video Diffusion Models Excel at Tracking Similar-Looking Objects Without Supervision
Chenshuang Zhang, Kang Zhang, Joon Son Chung, In So Kweon, Junmo Kim, Chengzhi Mao
TL;DR
This work shows that pretrained video diffusion models harbor motion representations in their high-noise denoising stages that are highly effective for tracking visually similar objects without supervision. By extracting motion cues (R_m) and combining them with appearance cues (R_a) into a fused representation (R_f), and applying label propagation, the authors achieve state-of-the-art pixel-level tracking on DAVIS-2017 and substantially better performance on datasets with similar-looking objects. TED demonstrates that motion understanding can emerge from generative models without dedicated tracking training, highlighting a new use case for diffusion models and suggesting avenues for efficiency improvements. The findings broaden the scope of diffusion-model applications beyond generation to robust, self-supervised perception tasks.
Abstract
Distinguishing visually similar objects by their motion remains a critical challenge in computer vision. Although supervised trackers show promise, contemporary self-supervised trackers struggle when visual cues become ambiguous, limiting their scalability and generalization without extensive labeled data. We find that pre-trained video diffusion models inherently learn motion representations suitable for tracking without task-specific training. This ability arises because their denoising process isolates motion in early, high-noise stages, distinct from later appearance refinement. Capitalizing on this discovery, our self-supervised tracker significantly improves performance in distinguishing visually similar objects, an underexplored failure point for existing methods. Our method achieves up to a 6-point improvement over recent self-supervised approaches on established benchmarks and our newly introduced tests focused on tracking visually similar items. Visualizations confirm that these diffusion-derived motion representations enable robust tracking of even identical objects across challenging viewpoint changes and deformations.
