AgentRVOS: Reasoning over Object Tracks for Zero-Shot Referring Video Object Segmentation

Woojeong Jin; Jaeho Lee; Heeseong Shin; Seungho Jang; Junhwan Heo; Seungryong Kim

AgentRVOS: Reasoning over Object Tracks for Zero-Shot Referring Video Object Segmentation

Woojeong Jin, Jaeho Lee, Heeseong Shin, Seungho Jang, Junhwan Heo, Seungryong Kim

Abstract

Referring Video Object Segmentation (RVOS) aims to segment a target object throughout a video given a natural language query. Training-free methods for this task follow a common pipeline: a MLLM selects keyframes, grounds the referred object within those frames, and a video segmentation model propagates the results. While intuitive, this design asks the MLLM to make temporal decisions before any object-level evidence is available, limiting both reasoning quality and spatio-temporal coverage. To overcome this, we propose AgentRVOS, a training-free agentic pipeline built on the complementary strengths of SAM3 and a MLLM. Given a concept derived from the query, SAM3 provides reliable perception over the full spatio-temporal extent through generated mask tracks. The MLLM then identifies the target through query-grounded reasoning over this object-level evidence, iteratively pruning guided by SAM3's temporal existence information. Extensive experiments show that AgentRVOS achieves state-of-the-art performance among training-free methods across multiple benchmarks, with consistent results across diverse MLLM backbones. Our project page is available at: https://cvlab-kaist.github.io/AgentRVOS/.

AgentRVOS: Reasoning over Object Tracks for Zero-Shot Referring Video Object Segmentation

Abstract

Paper Structure (39 sections, 2 equations, 18 figures, 7 tables, 3 algorithms)

This paper contains 39 sections, 2 equations, 18 figures, 7 tables, 3 algorithms.

Introduction
Related Work
Referring Video Object Segmentation.
MLLM-based Reasoning Video Object Segmentation.
Method
Problem Formulation and Overview
Candidate Mask Track Generation
Concept Extraction.
Mask Track Generation via SAM3.
Iterative Spatio-temporal Pruning
Candidate Pruning.
Temporal Scope Pruning.
Convergence.
Experiments
Datasets and Metrics.
...and 24 more sections

Figures (18)

Figure 1: Teaser. AgentRVOS is a training-free agentic pipeline built on the complementary strengths of SAM3 carion2025sam3 and an MLLM bai2025qwen3openai2025gpt5. The MLLM first uses SAM3 to generate candidate mask tracks, then iteratively prunes them through query-grounded reasoning over object-level evidence.
Figure 2: Complementary concept of SAM3 and MLLM. SAM3 carion2025sam3 can precisely identify objects without missing a single frame, but struggles with complex queries. MLLMs bai2025qwen3openai2025gpt5li2024llava, on the other hand, offer strong reasoning capabilities, but operate on sparse frames and struggle with non-salient objects. AgentRVOS combines the advantages of both SAM3 and MLLM, by interleaving the two models in a complementary manner.
Figure 3: Overall pipeline. Given a video and a natural language query, our pipeline operates in two phases. In Candidate Mask Track Generation (Sec. \ref{['sec:method_candidate']}), the MLLM first analyzes the query to extract concepts, which SAM3 uses to produce temporally consistent candidate mask tracks; this process iterates to ensure sufficient coverage. In Iterative Spatio-temporal Pruning (Sec. \ref{['sec:method_pruning']}), the MLLM reasons over the candidate pool, classifying each candidate as Accepted, Rejected, or Uncertain, while progressively narrowing the spatio-temporal scope until convergence.
Figure 4: Qualitative results. AgentRVOS effectively resolves challenging scenarios such as multi-instance ambiguity and temporal reasoning, accurately segmenting the referred objects.
Figure 5: Qualitative results of iteration in Iterative Spatio-temporal Pruning. We illustrate how our iterative spatio-temporal pruning progressively narrows the relevant temporal window and eliminates irrelevant track candidates. Across iterations, the remaining candidates become fewer but more query-consistent, leading to the final selected track set.
...and 13 more figures

AgentRVOS: Reasoning over Object Tracks for Zero-Shot Referring Video Object Segmentation

Abstract

AgentRVOS: Reasoning over Object Tracks for Zero-Shot Referring Video Object Segmentation

Authors

Abstract

Table of Contents

Figures (18)