Table of Contents
Fetching ...

CoT-RVS: Zero-Shot Chain-of-Thought Reasoning Segmentation for Videos

Shiu-hong Kao, Yu-Wing Tai, Chi-Keung Tang

TL;DR

CoT-RVS tackles Reasoning Video Object Segmentation by leveraging zero-shot Chain-of-Thought prompts from pretrained multimodal LLMs to perform temporal-semantic reasoning for keyframe selection. It introduces a modular three-agent pipeline (keyframe selector, reasoning segmentation model, video processor) that operates in offline and online modes without fine-tuning, enabling instance-level mask sequences that respect temporal context. Evaluations on MeViS, Refer-DAVIS-17, ReVOS, and ReasonVOS show significant gains over state-of-the-art methods, especially for temporally sensitive queries, with an online extension that updates targets during streaming video. The framework is flexible across open and closed-source MLLMs and demonstrates competitive robustness and modularity, opening new directions for zero-shot reasoning in dynamic visual environments.

Abstract

Reasoning Video Object Segmentation is a challenging task, aiming at generating a mask sequence from an input video given a complex and implicit text query. While existing works finetune Multimodal Large Language Models (MLLM) for the task, they still fail in video inputs given complex temporally-sensitive queries, indicating their lack of temporal and spatial integration in complex scenarios. In this paper, we propose CoT-RVS, a novel framework employing the zero-shot Chain-of-Thought (CoT) capability of MLLM to address these complex challenges by temporal-semantic reasoning: CoT-RVS analyzes the visible objects within a given frame that possibly match the language query (semantic), and chooses a corresponding keyframe for each object that can be observed effortlessly among all frames (temporal). Notably, the CoT-RVS framework is training-free and compatible with closed-source MLLMs, which can be applied to Reasoning Video Instance Segmentation. Our framework's training-free feature further allows its extension to process online video streams, where the CoT is used at test time to update the object of interest when a better target starts to emerge and becomes visible. We conduct extensive experiments on video object segmentation with explicit and implicit queries. The results show that CoT-RVS significantly outperforms previous works in both cases, qualitatively and quantitatively.

CoT-RVS: Zero-Shot Chain-of-Thought Reasoning Segmentation for Videos

TL;DR

CoT-RVS tackles Reasoning Video Object Segmentation by leveraging zero-shot Chain-of-Thought prompts from pretrained multimodal LLMs to perform temporal-semantic reasoning for keyframe selection. It introduces a modular three-agent pipeline (keyframe selector, reasoning segmentation model, video processor) that operates in offline and online modes without fine-tuning, enabling instance-level mask sequences that respect temporal context. Evaluations on MeViS, Refer-DAVIS-17, ReVOS, and ReasonVOS show significant gains over state-of-the-art methods, especially for temporally sensitive queries, with an online extension that updates targets during streaming video. The framework is flexible across open and closed-source MLLMs and demonstrates competitive robustness and modularity, opening new directions for zero-shot reasoning in dynamic visual environments.

Abstract

Reasoning Video Object Segmentation is a challenging task, aiming at generating a mask sequence from an input video given a complex and implicit text query. While existing works finetune Multimodal Large Language Models (MLLM) for the task, they still fail in video inputs given complex temporally-sensitive queries, indicating their lack of temporal and spatial integration in complex scenarios. In this paper, we propose CoT-RVS, a novel framework employing the zero-shot Chain-of-Thought (CoT) capability of MLLM to address these complex challenges by temporal-semantic reasoning: CoT-RVS analyzes the visible objects within a given frame that possibly match the language query (semantic), and chooses a corresponding keyframe for each object that can be observed effortlessly among all frames (temporal). Notably, the CoT-RVS framework is training-free and compatible with closed-source MLLMs, which can be applied to Reasoning Video Instance Segmentation. Our framework's training-free feature further allows its extension to process online video streams, where the CoT is used at test time to update the object of interest when a better target starts to emerge and becomes visible. We conduct extensive experiments on video object segmentation with explicit and implicit queries. The results show that CoT-RVS significantly outperforms previous works in both cases, qualitatively and quantitatively.

Paper Structure

This paper contains 24 sections, 7 equations, 15 figures, 12 tables.

Figures (15)

  • Figure 1: CoT-RVS is a novel framework for Reasoning Video Segmentation utilizing the zero-shot Chain-of-Thought (CoT) capability of pretrained multimodal Large Language Models, and correctly segments the fast-moving player right before, during, and after his three-point shot after CoT, given a temporally sensitive query " Which player makes a three-point shot in this basketball game?" which requires both spatial and temporal reasoning.
  • Figure 2: Keyframe selection is particularly challenging when processing time sensitive queries. Given a video, where two cats are playing with the cat teaser at the beginning, and then the white cat stays static in the second half of the video, existing methods e.g., VISA yan2024visa fail to find a proper keyframe and directly use the user prompt for reasoning segmentation. In contrast, CoT-RVS extracts temporal-semantic correlation from the input video and successfully outputs a more reasonable keyframe with detailed description to the target object.
  • Figure 3: CoT-RVS for Reasoning VIS where Reasoning VOS is a special case.
  • Figure 4: Illustration of the CoT Process. The MLLM agent is prompted to synthesize chain-of-thought questions and answers. This CoT process enhances the in-depth temporal and spatial understanding for the keyframe candidates. Based on the CoT result, the MLLM agent outputs an instance-level object list, which contains the object-specific keyframe and description within the frame. Refer to the \ref{['ap:offline_detals', 'ap:cot_output']} for the detailed prompt and output examples.
  • Figure 5: CoT-RVS for Online Reasoning Video Object Segmentation.
  • ...and 10 more figures