Table of Contents
Fetching ...

ReVSeg: Incentivizing the Reasoning Chain for Video Segmentation with Reinforcement Learning

Yifan Li, Yingda Yin, Lingting Zhu, Weikai Chen, Shengju Qian, Xin Wang, Yanwei Fu

TL;DR

ReVSeg addresses reasoning-centric video object segmentation by decomposing the reasoning process into three native VLM primitives: semantics interpretation, temporal grounding, and spatial grounding. It replaces single-step latent predictions with a two-round reasoning chain executed inside a single vision-language model, and optimizes the chain with Group Relative Policy Optimization using reasoning-aligned rewards. The method achieves state-of-the-art results on ReasonVOS, ReVOS, and several referring VOS benchmarks, while providing auditable, stepwise reasoning traces. This explicit-chain, outcome-driven framework offers a general paradigm for improving reasoning in video understanding models.

Abstract

Reasoning-centric video object segmentation is an inherently complex task: the query often refers to dynamics, causality, and temporal interactions, rather than static appearances. Yet existing solutions generally collapse these factors into simplified reasoning with latent embeddings, rendering the reasoning chain opaque and essentially intractable. We therefore adopt an explicit decomposition perspective and introduce ReVSeg, which executes reasoning as sequential decisions in the native interface of pretrained vision language models (VLMs). Rather than folding all reasoning into a single-step prediction, ReVSeg executes three explicit operations -- semantics interpretation, temporal evidence selection, and spatial grounding -- aligning pretrained capabilities. We further employ reinforcement learning to optimize the multi-step reasoning chain, enabling the model to self-refine its decision quality from outcome-driven signals. Experimental results demonstrate that ReVSeg attains state-of-the-art performances on standard video object segmentation benchmarks and yields interpretable reasoning trajectories. Project page is available at https://clementine24.github.io/ReVSeg/ .

ReVSeg: Incentivizing the Reasoning Chain for Video Segmentation with Reinforcement Learning

TL;DR

ReVSeg addresses reasoning-centric video object segmentation by decomposing the reasoning process into three native VLM primitives: semantics interpretation, temporal grounding, and spatial grounding. It replaces single-step latent predictions with a two-round reasoning chain executed inside a single vision-language model, and optimizes the chain with Group Relative Policy Optimization using reasoning-aligned rewards. The method achieves state-of-the-art results on ReasonVOS, ReVOS, and several referring VOS benchmarks, while providing auditable, stepwise reasoning traces. This explicit-chain, outcome-driven framework offers a general paradigm for improving reasoning in video understanding models.

Abstract

Reasoning-centric video object segmentation is an inherently complex task: the query often refers to dynamics, causality, and temporal interactions, rather than static appearances. Yet existing solutions generally collapse these factors into simplified reasoning with latent embeddings, rendering the reasoning chain opaque and essentially intractable. We therefore adopt an explicit decomposition perspective and introduce ReVSeg, which executes reasoning as sequential decisions in the native interface of pretrained vision language models (VLMs). Rather than folding all reasoning into a single-step prediction, ReVSeg executes three explicit operations -- semantics interpretation, temporal evidence selection, and spatial grounding -- aligning pretrained capabilities. We further employ reinforcement learning to optimize the multi-step reasoning chain, enabling the model to self-refine its decision quality from outcome-driven signals. Experimental results demonstrate that ReVSeg attains state-of-the-art performances on standard video object segmentation benchmarks and yields interpretable reasoning trajectories. Project page is available at https://clementine24.github.io/ReVSeg/ .

Paper Structure

This paper contains 18 sections, 9 equations, 6 figures, 7 tables.

Figures (6)

  • Figure 1: (Left) Through an explicit reasoning chain, our ReVSeg tackles reasoning-focused video object segmentation and accurately grounds objects referenced by complex, abstract real-world queries. (Right) While the base model and its RL variant struggle on the task, our method achieves strong performance, with RL post-training yielding a further substantial boost. We report the $\mathcal{J}\&\mathcal{F}$ metric on Ref-DAVIS17 (in-domain) and ReasonVOS (out-of-domain) datasets in the chart.
  • Figure 2: Overview of ReVSeg. The model runs a two-turn reasoning chain over the input video and query. Round one analyzes the scene and selects an informative keyframe with a concise object description. Round two grounds the target on that keyframe by predicting a bounding box. The keyframe-bbox pair conditions a video tracker to produce full segmentation sequence. A reward manager provides concise signals to post-train the VLM via reinforcement learning, improving keyframe selection, grounding accuracy, and overall robustness.
  • Figure 3: Qualitative cases of ReVSeg on ReasonVOS bai2024one. The frame highlighted in red indicates the selected keyframe. The green bounding box within the enlarged keyframe on the right size represents the grounding result. Zoom in to view visual details.
  • Figure 4: Training curves of ReVSeg. (a) Format reward $r_f$ rapidly converges to a full score and remains saturated. (b) Temporal reward $r_t$ and (c) Spatial reward $r_s$ increase steadily with training. (d) Response length remains stable overall without collapse. (e) Total reward $r$ rises consistently over time. (f) Average number of rollout turns quickly converge to 2.
  • Figure 5: Additional qualitative cases of ReVSeg. The frame highlighted in red indicates the selected keyframe. The green bounding box within the enlarged keyframe on the right side represents the grounding result. Zoom in to view visual details.
  • ...and 1 more figures