CoT-RVS: Zero-Shot Chain-of-Thought Reasoning Segmentation for Videos
Shiu-hong Kao, Yu-Wing Tai, Chi-Keung Tang
TL;DR
CoT-RVS tackles Reasoning Video Object Segmentation by leveraging zero-shot Chain-of-Thought prompts from pretrained multimodal LLMs to perform temporal-semantic reasoning for keyframe selection. It introduces a modular three-agent pipeline (keyframe selector, reasoning segmentation model, video processor) that operates in offline and online modes without fine-tuning, enabling instance-level mask sequences that respect temporal context. Evaluations on MeViS, Refer-DAVIS-17, ReVOS, and ReasonVOS show significant gains over state-of-the-art methods, especially for temporally sensitive queries, with an online extension that updates targets during streaming video. The framework is flexible across open and closed-source MLLMs and demonstrates competitive robustness and modularity, opening new directions for zero-shot reasoning in dynamic visual environments.
Abstract
Reasoning Video Object Segmentation is a challenging task, aiming at generating a mask sequence from an input video given a complex and implicit text query. While existing works finetune Multimodal Large Language Models (MLLM) for the task, they still fail in video inputs given complex temporally-sensitive queries, indicating their lack of temporal and spatial integration in complex scenarios. In this paper, we propose CoT-RVS, a novel framework employing the zero-shot Chain-of-Thought (CoT) capability of MLLM to address these complex challenges by temporal-semantic reasoning: CoT-RVS analyzes the visible objects within a given frame that possibly match the language query (semantic), and chooses a corresponding keyframe for each object that can be observed effortlessly among all frames (temporal). Notably, the CoT-RVS framework is training-free and compatible with closed-source MLLMs, which can be applied to Reasoning Video Instance Segmentation. Our framework's training-free feature further allows its extension to process online video streams, where the CoT is used at test time to update the object of interest when a better target starts to emerge and becomes visible. We conduct extensive experiments on video object segmentation with explicit and implicit queries. The results show that CoT-RVS significantly outperforms previous works in both cases, qualitatively and quantitatively.
