Table of Contents
Fetching ...

Segment Any Motion in Videos

Nan Huang, Wenzhao Zheng, Chenfeng Xu, Kurt Keutzer, Shanghang Zhang, Angjoo Kanazawa, Qianqian Wang

TL;DR

This work tackles moving object segmentation by combining long-range trajectory motion cues, self-supervised DINO semantic features, and SAM2-based mask densification in a novel MOS pipeline. The approach introduces Spatio-Temporal Trajectory Attention and Motion-Semantic Decoupled Embedding to robustly identify dynamic trajectories, which are then densified into pixel-perfect masks via iterative SAM2 prompting. Extensive experiments on DAVIS, FBMS-59, and SegTrack v2 demonstrate state-of-the-art results, with clear advantages in fine-grained, per-object segmentation and robustness to challenging conditions such as drastic camera motion and complex deformations. The method highlights strong generalization, scalable PROMPT-based refinement, and practical potential for applications requiring precise moving-object masks.

Abstract

Moving object segmentation is a crucial task for achieving a high-level understanding of visual scenes and has numerous downstream applications. Humans can effortlessly segment moving objects in videos. Previous work has largely relied on optical flow to provide motion cues; however, this approach often results in imperfect predictions due to challenges such as partial motion, complex deformations, motion blur and background distractions. We propose a novel approach for moving object segmentation that combines long-range trajectory motion cues with DINO-based semantic features and leverages SAM2 for pixel-level mask densification through an iterative prompting strategy. Our model employs Spatio-Temporal Trajectory Attention and Motion-Semantic Decoupled Embedding to prioritize motion while integrating semantic support. Extensive testing on diverse datasets demonstrates state-of-the-art performance, excelling in challenging scenarios and fine-grained segmentation of multiple objects. Our code is available at https://motion-seg.github.io/.

Segment Any Motion in Videos

TL;DR

This work tackles moving object segmentation by combining long-range trajectory motion cues, self-supervised DINO semantic features, and SAM2-based mask densification in a novel MOS pipeline. The approach introduces Spatio-Temporal Trajectory Attention and Motion-Semantic Decoupled Embedding to robustly identify dynamic trajectories, which are then densified into pixel-perfect masks via iterative SAM2 prompting. Extensive experiments on DAVIS, FBMS-59, and SegTrack v2 demonstrate state-of-the-art results, with clear advantages in fine-grained, per-object segmentation and robustness to challenging conditions such as drastic camera motion and complex deformations. The method highlights strong generalization, scalable PROMPT-based refinement, and practical potential for applications requiring precise moving-object masks.

Abstract

Moving object segmentation is a crucial task for achieving a high-level understanding of visual scenes and has numerous downstream applications. Humans can effortlessly segment moving objects in videos. Previous work has largely relied on optical flow to provide motion cues; however, this approach often results in imperfect predictions due to challenges such as partial motion, complex deformations, motion blur and background distractions. We propose a novel approach for moving object segmentation that combines long-range trajectory motion cues with DINO-based semantic features and leverages SAM2 for pixel-level mask densification through an iterative prompting strategy. Our model employs Spatio-Temporal Trajectory Attention and Motion-Semantic Decoupled Embedding to prioritize motion while integrating semantic support. Extensive testing on diverse datasets demonstrates state-of-the-art performance, excelling in challenging scenarios and fine-grained segmentation of multiple objects. Our code is available at https://motion-seg.github.io/.

Paper Structure

This paper contains 24 sections, 1 equation, 13 figures, 4 tables, 1 algorithm.

Figures (13)

  • Figure 1: Our method is capable of handling challenging scenarios, including articulated structures, shadow reflections, dynamic background motion, and drastic camera movements, while producing per object level fine-grained moving object masks.
  • Figure 2: The effectiveness of long-range tracks. Over longer periods of time, if a moving object experiences factors such as occlusion or changes in lighting, it can negatively affect the tracking performance of optical-flow-based methods for that object.
  • Figure 3: Overview of Our Pipeline. We take 2D tracks and depth maps generated by off-the-shelf models doersch2024bootstapdepthanything as input, which are then processed by a motion encoder to capture motion patterns, producing featured tracks. Next, we use tracks decoder that integrates DINO feature oquab2023dinov2 to decode the featured tracks by decoupling motion and semantic information and ultimately obtain the dynamic trajectories(a). Finally, using SAM2 ravi2024sam2, we group dynamic tracks belonging to the same object and generate fine-grained moving object masks(b).
  • Figure 4: Qualitative comparison on DAVIS17-moving benchmarks. For each sequence we show moving object mask results. Our method successfully handles water reflections (left), camouflage appearances (middle), and drastic camera motion (right).
  • Figure 5: Qualitative comparison on FBMS-59 benchmarks. The masks produced by us are geometrically more complete and detailed.
  • ...and 8 more figures