Table of Contents
Fetching ...

VideoSeg-R1:Reasoning Video Object Segmentation via Reinforcement Learning

Zishan Xu, Yifu Guo, Yuquan Lu, Fengyu Yang, Junxin Li

TL;DR

VideoSeg-R1 tackles Referring Video Object Segmentation by introducing reinforcement learning to enable explicit reasoning over long video sequences. It decouples the task into hierarchical text-guided frame sampling, GRPO-enhanced multimodal reasoning that yields spatial cues, and a segmentation-propagation stage using SAM2 and XMem, with a task-difficulty-aware mechanism to adapt reasoning length. Key contributions include the first RL-based framework for reasoning-aware RVOS, a Hierarchical Text-guided Frame Sampler, a soft length penalty tied to a learned task difficulty, and a multi-object matching strategy via Hungarian assignment, all contributing to state-of-the-art results on multiple benchmarks. The approach demonstrates strong generalization to challenging, out-of-distribution queries and sets a new direction for interpretable, resource-aware video understanding, albeit with notable computational costs.

Abstract

Traditional video reasoning segmentation methods rely on supervised fine-tuning, which limits generalization to out-of-distribution scenarios and lacks explicit reasoning. To address this, we propose \textbf{VideoSeg-R1}, the first framework to introduce reinforcement learning into video reasoning segmentation. It adopts a decoupled architecture that formulates the task as joint referring image segmentation and video mask propagation. It comprises three stages: (1) A hierarchical text-guided frame sampler to emulate human attention; (2) A reasoning model that produces spatial cues along with explicit reasoning chains; and (3) A segmentation-propagation stage using SAM2 and XMem. A task difficulty-aware mechanism adaptively controls reasoning length for better efficiency and accuracy. Extensive evaluations on multiple benchmarks demonstrate that VideoSeg-R1 achieves state-of-the-art performance in complex video reasoning and segmentation tasks. The code will be publicly available at https://github.com/euyis1019/VideoSeg-R1.

VideoSeg-R1:Reasoning Video Object Segmentation via Reinforcement Learning

TL;DR

VideoSeg-R1 tackles Referring Video Object Segmentation by introducing reinforcement learning to enable explicit reasoning over long video sequences. It decouples the task into hierarchical text-guided frame sampling, GRPO-enhanced multimodal reasoning that yields spatial cues, and a segmentation-propagation stage using SAM2 and XMem, with a task-difficulty-aware mechanism to adapt reasoning length. Key contributions include the first RL-based framework for reasoning-aware RVOS, a Hierarchical Text-guided Frame Sampler, a soft length penalty tied to a learned task difficulty, and a multi-object matching strategy via Hungarian assignment, all contributing to state-of-the-art results on multiple benchmarks. The approach demonstrates strong generalization to challenging, out-of-distribution queries and sets a new direction for interpretable, resource-aware video understanding, albeit with notable computational costs.

Abstract

Traditional video reasoning segmentation methods rely on supervised fine-tuning, which limits generalization to out-of-distribution scenarios and lacks explicit reasoning. To address this, we propose \textbf{VideoSeg-R1}, the first framework to introduce reinforcement learning into video reasoning segmentation. It adopts a decoupled architecture that formulates the task as joint referring image segmentation and video mask propagation. It comprises three stages: (1) A hierarchical text-guided frame sampler to emulate human attention; (2) A reasoning model that produces spatial cues along with explicit reasoning chains; and (3) A segmentation-propagation stage using SAM2 and XMem. A task difficulty-aware mechanism adaptively controls reasoning length for better efficiency and accuracy. Extensive evaluations on multiple benchmarks demonstrate that VideoSeg-R1 achieves state-of-the-art performance in complex video reasoning and segmentation tasks. The code will be publicly available at https://github.com/euyis1019/VideoSeg-R1.

Paper Structure

This paper contains 27 sections, 9 equations, 8 figures, 13 tables.

Figures (8)

  • Figure 1: VideoSeg-R1 achieves state-of-the-art performance on both video and image benchmarks covering reasoning and referring segmentation tasks.
  • Figure 2: Our VideoSeg-R1 effectively segments and tracks in challenging scenarios, including: (a) objects in crowded scenes; (b) multiple objects with rapid motion; and (c) diverse targets appearing simultaneously.
  • Figure 3: Overview of VideoSeg-R1, which consists of the following three stages: (1) a hierarchical text-guided frame sampler to emulate human attention; (2) a reasoning model that produces spatial cues along with explicit reasoning chains; and (3) a segmentation-propagation stage using SAM2 and XMem.
  • Figure A1: Prompt templates for Hierarchical Text-guided Frame Sampling. The coarse-grained temporal localization prompt guides the model to identify key segment in the video, while the fine-grained frame localization prompt instructs the model to pinpoint traget frame via percentage estimation.
  • Figure D2: Design on the Difficulty Scoring scheme.
  • ...and 3 more figures