Table of Contents
Fetching ...

Training-Free Spatio-temporal Decoupled Reasoning Video Segmentation with Adaptive Object Memory

Zhengtong Zhu, Jiaqing Fan, Zhixuan Liu, Fanzhang Li

TL;DR

This work aims to design a training-free reasoning video segmentation framework that outperforms existing methods requiring fine-tuning, using only pre-trained models and proposes an Adaptive Object Memory module that selects and memorizes key objects based on motion cues in different video sequences.

Abstract

Reasoning Video Object Segmentation (ReasonVOS) is a challenging task that requires stable object segmentation across video sequences using implicit and complex textual inputs. Previous methods fine-tune Multimodal Large Language Models (MLLMs) to produce segmentation outputs, which demand substantial resources. Additionally, some existing methods are coupled in the processing of spatio-temporal information, which affects the temporal stability of the model to some extent. To address these issues, we propose Training-Free \textbf{S}patio-temporal \textbf{D}ecoupled Reasoning Video Segmentation with \textbf{A}daptive Object \textbf{M}emory (SDAM). We aim to design a training-free reasoning video segmentation framework that outperforms existing methods requiring fine-tuning, using only pre-trained models. Meanwhile, we propose an Adaptive Object Memory module that selects and memorizes key objects based on motion cues in different video sequences. Finally, we propose Spatio-temporal Decoupling for stable temporal propagation. In the spatial domain, we achieve precise localization and segmentation of target objects, while in the temporal domain, we leverage key object temporal information to drive stable cross-frame propagation. Our method achieves excellent results on five benchmark datasets, including Ref-YouTubeVOS, Ref-DAVIS17, MeViS, ReasonVOS, and ReVOS.

Training-Free Spatio-temporal Decoupled Reasoning Video Segmentation with Adaptive Object Memory

TL;DR

This work aims to design a training-free reasoning video segmentation framework that outperforms existing methods requiring fine-tuning, using only pre-trained models and proposes an Adaptive Object Memory module that selects and memorizes key objects based on motion cues in different video sequences.

Abstract

Reasoning Video Object Segmentation (ReasonVOS) is a challenging task that requires stable object segmentation across video sequences using implicit and complex textual inputs. Previous methods fine-tune Multimodal Large Language Models (MLLMs) to produce segmentation outputs, which demand substantial resources. Additionally, some existing methods are coupled in the processing of spatio-temporal information, which affects the temporal stability of the model to some extent. To address these issues, we propose Training-Free \textbf{S}patio-temporal \textbf{D}ecoupled Reasoning Video Segmentation with \textbf{A}daptive Object \textbf{M}emory (SDAM). We aim to design a training-free reasoning video segmentation framework that outperforms existing methods requiring fine-tuning, using only pre-trained models. Meanwhile, we propose an Adaptive Object Memory module that selects and memorizes key objects based on motion cues in different video sequences. Finally, we propose Spatio-temporal Decoupling for stable temporal propagation. In the spatial domain, we achieve precise localization and segmentation of target objects, while in the temporal domain, we leverage key object temporal information to drive stable cross-frame propagation. Our method achieves excellent results on five benchmark datasets, including Ref-YouTubeVOS, Ref-DAVIS17, MeViS, ReasonVOS, and ReVOS.
Paper Structure (28 sections, 9 equations, 5 figures, 6 tables)

This paper contains 28 sections, 9 equations, 5 figures, 6 tables.

Figures (5)

  • Figure 1: Overview of SDAM. (a) Some existing methods rely on image-level video understanding, establishing correspondence between text descriptions and video frames, while neglecting the inherent temporal information in video tasks. (b) Our training-free framework adaptively memorizes key object information based on motion cues in the video and the frame-level confidence jointly obtained from MLLM and SAM. Additionally, we decouple the spatio-temporal information in the video to enhance the temporal stability of the architecture during the segmentation process.
  • Figure 2: The overall pipeline of SDAM. Our method consists of two parts: (a) Spatio-temporal Decoupling (SD). We pass the keyframe candidates into MLLM and SAM to obtain objects information in the spatial domain, and then use the Object Tracker to propagate the key object across the temporal domain. (b) Adaptive Object Memory (AOM). We first use Motion Driven Sampler to adaptively sample keyframe candidates based on motion cues, then use Joint Keyframe Selection to select the frame with the highest confidence as the keyframe, and store the key object memory in the Object Memory Bank.
  • Figure 3: Motion Driven Sampler. To obtain a keyframe candidate set with richer spatio-temporal information, we propose the MDS module, which can adaptively select frames with more significant scene changes as candidates based on the motion cues in the video sequence.
  • Figure 4: Visualization of temporal stability analysis. We analyze the temporal stability of SDAM, MUTR, and ReferFormer on the Ref-YouTube-VOS dataset.
  • Figure 5: Qualitative results on the RefVOS (a) and ReasonVOS (b) datasets. The time steps are directed from left to right. Zoom in for better viewing.