Segment Any Motion in Videos
Nan Huang, Wenzhao Zheng, Chenfeng Xu, Kurt Keutzer, Shanghang Zhang, Angjoo Kanazawa, Qianqian Wang
TL;DR
This work tackles moving object segmentation by combining long-range trajectory motion cues, self-supervised DINO semantic features, and SAM2-based mask densification in a novel MOS pipeline. The approach introduces Spatio-Temporal Trajectory Attention and Motion-Semantic Decoupled Embedding to robustly identify dynamic trajectories, which are then densified into pixel-perfect masks via iterative SAM2 prompting. Extensive experiments on DAVIS, FBMS-59, and SegTrack v2 demonstrate state-of-the-art results, with clear advantages in fine-grained, per-object segmentation and robustness to challenging conditions such as drastic camera motion and complex deformations. The method highlights strong generalization, scalable PROMPT-based refinement, and practical potential for applications requiring precise moving-object masks.
Abstract
Moving object segmentation is a crucial task for achieving a high-level understanding of visual scenes and has numerous downstream applications. Humans can effortlessly segment moving objects in videos. Previous work has largely relied on optical flow to provide motion cues; however, this approach often results in imperfect predictions due to challenges such as partial motion, complex deformations, motion blur and background distractions. We propose a novel approach for moving object segmentation that combines long-range trajectory motion cues with DINO-based semantic features and leverages SAM2 for pixel-level mask densification through an iterative prompting strategy. Our model employs Spatio-Temporal Trajectory Attention and Motion-Semantic Decoupled Embedding to prioritize motion while integrating semantic support. Extensive testing on diverse datasets demonstrates state-of-the-art performance, excelling in challenging scenarios and fine-grained segmentation of multiple objects. Our code is available at https://motion-seg.github.io/.
