Table of Contents
Fetching ...

VIRST: Video-Instructed Reasoning Assistant for SpatioTemporal Segmentation

Jihwan Hong, Jaeyoung Do

Abstract

Referring Video Object Segmentation (RVOS) aims to segment target objects in videos based on natural language descriptions. However, fixed keyframe-based approaches that couple a vision language model with a separate propagation module often fail to capture rapidly changing spatiotemporal dynamics and to handle queries requiring multi-step reasoning, leading to sharp performance drops on motion-intensive and reasoning-oriented videos beyond static RVOS benchmarks. To address these limitations, we propose VIRST (Video-Instructed Reasoning Assistant for Spatio-Temporal Segmentation), an end-to-end framework that unifies global video reasoning and pixel-level mask prediction within a single model. VIRST bridges semantic and segmentation representations through the Spatio-Temporal Fusion (STF), which fuses segmentation-aware video features into the vision-language backbone, and employs the Temporal Dynamic Anchor Updater to maintain temporally adjacent anchor frames that provide stable temporal cues under large motion, occlusion, and reappearance. This unified design achieves state-of-the-art results across diverse RVOS benchmarks under realistic and challenging conditions, demonstrating strong generalization to both referring and reasoning oriented settings. The code and checkpoints are available at https://github.com/AIDASLab/VIRST.

VIRST: Video-Instructed Reasoning Assistant for SpatioTemporal Segmentation

Abstract

Referring Video Object Segmentation (RVOS) aims to segment target objects in videos based on natural language descriptions. However, fixed keyframe-based approaches that couple a vision language model with a separate propagation module often fail to capture rapidly changing spatiotemporal dynamics and to handle queries requiring multi-step reasoning, leading to sharp performance drops on motion-intensive and reasoning-oriented videos beyond static RVOS benchmarks. To address these limitations, we propose VIRST (Video-Instructed Reasoning Assistant for Spatio-Temporal Segmentation), an end-to-end framework that unifies global video reasoning and pixel-level mask prediction within a single model. VIRST bridges semantic and segmentation representations through the Spatio-Temporal Fusion (STF), which fuses segmentation-aware video features into the vision-language backbone, and employs the Temporal Dynamic Anchor Updater to maintain temporally adjacent anchor frames that provide stable temporal cues under large motion, occlusion, and reappearance. This unified design achieves state-of-the-art results across diverse RVOS benchmarks under realistic and challenging conditions, demonstrating strong generalization to both referring and reasoning oriented settings. The code and checkpoints are available at https://github.com/AIDASLab/VIRST.

Paper Structure

This paper contains 54 sections, 23 equations, 10 figures, 11 tables, 2 algorithms.

Figures (10)

  • Figure 1: Performance comparison with existing RVOS methods. VIRST achieves state-of-the-art results across all referring video object segmentation benchmarks, while maintaining competitive performance on referring and reasoning-based image segmentation tasks.
  • Figure 2: Overall architecture of VIRST. (a) VIRST utilizes a VLM to capture global video context and identify query-aligned targets. The Spatio-Temporal Fusion (STF) fuses features from the segmentation-aware vision encoder, while the Temporal Dynamic Anchor Updater (TDAU) provides local and long-range temporal cues through a dual-track memory design. (b) The ST-Fusion module includes an Initial ST-Fusion stage, where the [ST] tokens are fused with segmentation-aware video tokens prior to VLM processing, followed by the Second ST-Fusion stage that applies cross-attention between the temporally expanded [ST] tokens and the segmentation-aware video tokens. The resulting spatiotemporal prompts are sliced to produce frame-specific segmentation prompts.
  • Figure 3: Anchor selection scheme of TDAU. Given a video, TDAU selects anchor-frame candidates$\mathcal{A}$ and generates their segmentation masks using frame-wise segmentation prompts. For each non-anchor frame, the module retrieves the $\alpha$ temporally nearest anchor frames $\mathcal{I}^{(k)}_{\text{Anchor}}$. The anchor set is temporally updated over time as the video advances. This strategy maintains temporal locality while ensuring coverage of a broader temporal range.
  • Figure 4: Qualitative results of VIRST. Across diverse video segmentation scenarios, VIRST generates high-quality masks despite strong distractors, reasoning-oriented queries, heavy occlusions, small objects, and multiple interacting instances—demonstrating robust spatiotemporal reasoning and effective integration of both global and local video context. Results are best viewed when zoomed in.
  • Figure 5: STF patch-wise attention visualization. We visualize the $8 \times 8$ patch-level attention from the STF before feeding it into the segmentation decoder. The attention maps consistently highlight key motion regions along the spatiotemporal dimension.
  • ...and 5 more figures