MovSAM: A Single-image Moving Object Segmentation Framework Based on Deep Thinking
Chang Nie, Yiqing Xu, Guangming Wang, Zhe Liu, Yanzi Miao, Hesheng Wang
TL;DR
MovSAM introduces a pioneering single-image moving object segmentation framework that leverages a Multimodal Large Language Model with Chain-of-Thought prompting to describe the moving object, and cross-fuses this textual guidance with segmentation-ready visual features from the Segment Anything Model and a Vision-Language Model. A deep-thinking loop enables iterative refinement of prompts and segmentation, mitigating errors due to occlusion, motion blur, or ambiguous cues in a single image. The system achieves state-of-the-art performance on MOS benchmarks (e.g., $ ext{$\mathcal{J}$\&$\mathcal{F}$} = 92.5\%$ on DAVIS2016) while operating without temporal information, and demonstrates practical viability in real-world autonomous driving scenarios. MovSAM’s cross-modal fusion and reasoning-based refinement offer a new paradigm for robust scene understanding when multi-frame cues are unavailable, potentially enabling single-image optical flow estimation and other downstream tasks in robotics and vision. The approach highlights the importance of integrating deep-thinking capabilities of large models with structured segmentation pipelines for complex visual reasoning.
Abstract
Moving object segmentation plays a vital role in understanding dynamic visual environments. While existing methods rely on multi-frame image sequences to identify moving objects, single-image MOS is critical for applications like motion intention prediction and handling camera frame drops. However, segmenting moving objects from a single image remains challenging for existing methods due to the absence of temporal cues. To address this gap, we propose MovSAM, the first framework for single-image moving object segmentation. MovSAM leverages a Multimodal Large Language Model (MLLM) enhanced with Chain-of-Thought (CoT) prompting to search the moving object and generate text prompts based on deep thinking for segmentation. These prompts are cross-fused with visual features from the Segment Anything Model (SAM) and a Vision-Language Model (VLM), enabling logic-driven moving object segmentation. The segmentation results then undergo a deep thinking refinement loop, allowing MovSAM to iteratively improve its understanding of the scene context and inter-object relationships with logical reasoning. This innovative approach enables MovSAM to segment moving objects in single images by considering scene understanding. We implement MovSAM in the real world to validate its practical application and effectiveness for autonomous driving scenarios where the multi-frame methods fail. Furthermore, despite the inherent advantage of multi-frame methods in utilizing temporal information, MovSAM achieves state-of-the-art performance across public MOS benchmarks, reaching 92.5\% on J\&F. Our implementation will be available at https://github.com/IRMVLab/MovSAM.
