Table of Contents
Fetching ...

MovSAM: A Single-image Moving Object Segmentation Framework Based on Deep Thinking

Chang Nie, Yiqing Xu, Guangming Wang, Zhe Liu, Yanzi Miao, Hesheng Wang

TL;DR

MovSAM introduces a pioneering single-image moving object segmentation framework that leverages a Multimodal Large Language Model with Chain-of-Thought prompting to describe the moving object, and cross-fuses this textual guidance with segmentation-ready visual features from the Segment Anything Model and a Vision-Language Model. A deep-thinking loop enables iterative refinement of prompts and segmentation, mitigating errors due to occlusion, motion blur, or ambiguous cues in a single image. The system achieves state-of-the-art performance on MOS benchmarks (e.g., $ ext{$\mathcal{J}$\&$\mathcal{F}$} = 92.5\%$ on DAVIS2016) while operating without temporal information, and demonstrates practical viability in real-world autonomous driving scenarios. MovSAM’s cross-modal fusion and reasoning-based refinement offer a new paradigm for robust scene understanding when multi-frame cues are unavailable, potentially enabling single-image optical flow estimation and other downstream tasks in robotics and vision. The approach highlights the importance of integrating deep-thinking capabilities of large models with structured segmentation pipelines for complex visual reasoning.

Abstract

Moving object segmentation plays a vital role in understanding dynamic visual environments. While existing methods rely on multi-frame image sequences to identify moving objects, single-image MOS is critical for applications like motion intention prediction and handling camera frame drops. However, segmenting moving objects from a single image remains challenging for existing methods due to the absence of temporal cues. To address this gap, we propose MovSAM, the first framework for single-image moving object segmentation. MovSAM leverages a Multimodal Large Language Model (MLLM) enhanced with Chain-of-Thought (CoT) prompting to search the moving object and generate text prompts based on deep thinking for segmentation. These prompts are cross-fused with visual features from the Segment Anything Model (SAM) and a Vision-Language Model (VLM), enabling logic-driven moving object segmentation. The segmentation results then undergo a deep thinking refinement loop, allowing MovSAM to iteratively improve its understanding of the scene context and inter-object relationships with logical reasoning. This innovative approach enables MovSAM to segment moving objects in single images by considering scene understanding. We implement MovSAM in the real world to validate its practical application and effectiveness for autonomous driving scenarios where the multi-frame methods fail. Furthermore, despite the inherent advantage of multi-frame methods in utilizing temporal information, MovSAM achieves state-of-the-art performance across public MOS benchmarks, reaching 92.5\% on J\&F. Our implementation will be available at https://github.com/IRMVLab/MovSAM.

MovSAM: A Single-image Moving Object Segmentation Framework Based on Deep Thinking

TL;DR

MovSAM introduces a pioneering single-image moving object segmentation framework that leverages a Multimodal Large Language Model with Chain-of-Thought prompting to describe the moving object, and cross-fuses this textual guidance with segmentation-ready visual features from the Segment Anything Model and a Vision-Language Model. A deep-thinking loop enables iterative refinement of prompts and segmentation, mitigating errors due to occlusion, motion blur, or ambiguous cues in a single image. The system achieves state-of-the-art performance on MOS benchmarks (e.g., \mathcal{J}\mathcal{F} on DAVIS2016) while operating without temporal information, and demonstrates practical viability in real-world autonomous driving scenarios. MovSAM’s cross-modal fusion and reasoning-based refinement offer a new paradigm for robust scene understanding when multi-frame cues are unavailable, potentially enabling single-image optical flow estimation and other downstream tasks in robotics and vision. The approach highlights the importance of integrating deep-thinking capabilities of large models with structured segmentation pipelines for complex visual reasoning.

Abstract

Moving object segmentation plays a vital role in understanding dynamic visual environments. While existing methods rely on multi-frame image sequences to identify moving objects, single-image MOS is critical for applications like motion intention prediction and handling camera frame drops. However, segmenting moving objects from a single image remains challenging for existing methods due to the absence of temporal cues. To address this gap, we propose MovSAM, the first framework for single-image moving object segmentation. MovSAM leverages a Multimodal Large Language Model (MLLM) enhanced with Chain-of-Thought (CoT) prompting to search the moving object and generate text prompts based on deep thinking for segmentation. These prompts are cross-fused with visual features from the Segment Anything Model (SAM) and a Vision-Language Model (VLM), enabling logic-driven moving object segmentation. The segmentation results then undergo a deep thinking refinement loop, allowing MovSAM to iteratively improve its understanding of the scene context and inter-object relationships with logical reasoning. This innovative approach enables MovSAM to segment moving objects in single images by considering scene understanding. We implement MovSAM in the real world to validate its practical application and effectiveness for autonomous driving scenarios where the multi-frame methods fail. Furthermore, despite the inherent advantage of multi-frame methods in utilizing temporal information, MovSAM achieves state-of-the-art performance across public MOS benchmarks, reaching 92.5\% on J\&F. Our implementation will be available at https://github.com/IRMVLab/MovSAM.

Paper Structure

This paper contains 16 sections, 6 equations, 7 figures, 5 tables.

Figures (7)

  • Figure 1: Overview of MovSAM deep thinking about segmenting the moving object from a single image. MovSAM begins with the MLLM searching for the moving object to generate an descriptive text prompt. Subsequently, the text prompt and the image enter a thinking loop: the moving object is first segmented by SAM and VLM in the segmentation module. Next, MLLM is deep thinking the segmentation result. The text prompt is regenerated if the segmentation is incorrect. If the segmentation is correct, the loop breaks, outputting the final segmented moving object.
  • Figure 2: The pipeline of the proposed MovSAM. First, Multimodal Large Language Model (MLLM) generates a text prompt with the moving object CoT, describing the moving object in the image (Sec. \ref{['init']}). Subsequently, in the moving object segment module, MovSAM cross-fuses the features of SAM and VLM to logically reason about the image to segment the moving object (Sec. \ref{['segment']}). The segmentation is then thought by the MLLM in a deep thinking loop (Sec. \ref{['evaluate']}). If the segmentation is incorrect, a new text prompt is generated for segmentation again. If correct, MLLM explains the reasons for the movement. Through this closed-loop framework of guidance, segmentation, and deep thinking, MovSAM achieves single-image moving object segmentation based on scene understanding and logical reasoning.
  • Figure 3: Real-world driving scenes. Sensor failures leading to dropped frames and pedestrian gaming scenes are very difficult for multi-image methods. In contrast, MovSAM based on deep thinking can accurately segment the moving object.
  • Figure 4: The qualitative results of TMO and MovSAM on various MOS datasets. The major errors are indicated by red circles. Motion blur can be addressed by deep thinking.
  • Figure 5: The sequence results of MovSAM on bmx-trees in the occluded scene. MovSAM can still segment the object under occlusion for each image of the sequence.
  • ...and 2 more figures