Table of Contents
Fetching ...

The Devil is in Temporal Token: High Quality Video Reasoning Segmentation

Sitong Gong, Yunzhi Zhuge, Lu Zhang, Zongxin Yang, Pingping Zhang, Huchuan Lu

TL;DR

VRS-HQ tackles Video Reasoning Segmentation by introducing a Temporal Token Encoding scheme with frame-level <SEG> and temporal <TAK> tokens produced by a Multimodal LLM. Temporal Dynamic Aggregation fuses frame-level features into a cohesive temporal representation, guiding a Token-driven Keyframe Selection that leverages SAM2 for end-to-end keyframe segmentation and propagation via memory. The method achieves state-of-the-art results on ReVOS and RVOS benchmarks, with strong ablations confirming the efficacy of TDA and TKS, and demonstrates robust cross-dataset generalization including RIS and reasoning segmentation tasks. The approach enables end-to-end video reasoning segmentation with improved keyframe localization and high-quality mask propagation, and the authors provide code and model weights for reproducibility.

Abstract

Existing methods for Video Reasoning Segmentation rely heavily on a single special token to represent the object in the keyframe or the entire video, inadequately capturing spatial complexity and inter-frame motion. To overcome these challenges, we propose VRS-HQ, an end-to-end video reasoning segmentation approach that leverages Multimodal Large Language Models (MLLMs) to inject rich spatiotemporal features into hierarchical tokens.Our key innovations include a Temporal Dynamic Aggregation (TDA) and a Token-driven Keyframe Selection (TKS). Specifically, we design frame-level <SEG> and temporal-level <TAK> tokens that utilize MLLM's autoregressive learning to effectively capture both local and global information. Subsequently, we apply a similarity-based weighted fusion and frame selection strategy, then utilize SAM2 to perform keyframe segmentation and propagation. To enhance keyframe localization accuracy, the TKS filters keyframes based on SAM2's occlusion scores during inference. VRS-HQ achieves state-of-the-art performance on ReVOS, surpassing VISA by 5.9%/12.5%/9.1% in J&F scores across the three subsets. These results highlight the strong temporal reasoning and segmentation capabilities of our method. Code and model weights will be released at VRS-HQ.

The Devil is in Temporal Token: High Quality Video Reasoning Segmentation

TL;DR

VRS-HQ tackles Video Reasoning Segmentation by introducing a Temporal Token Encoding scheme with frame-level <SEG> and temporal <TAK> tokens produced by a Multimodal LLM. Temporal Dynamic Aggregation fuses frame-level features into a cohesive temporal representation, guiding a Token-driven Keyframe Selection that leverages SAM2 for end-to-end keyframe segmentation and propagation via memory. The method achieves state-of-the-art results on ReVOS and RVOS benchmarks, with strong ablations confirming the efficacy of TDA and TKS, and demonstrates robust cross-dataset generalization including RIS and reasoning segmentation tasks. The approach enables end-to-end video reasoning segmentation with improved keyframe localization and high-quality mask propagation, and the authors provide code and model weights for reproducibility.

Abstract

Existing methods for Video Reasoning Segmentation rely heavily on a single special token to represent the object in the keyframe or the entire video, inadequately capturing spatial complexity and inter-frame motion. To overcome these challenges, we propose VRS-HQ, an end-to-end video reasoning segmentation approach that leverages Multimodal Large Language Models (MLLMs) to inject rich spatiotemporal features into hierarchical tokens.Our key innovations include a Temporal Dynamic Aggregation (TDA) and a Token-driven Keyframe Selection (TKS). Specifically, we design frame-level <SEG> and temporal-level <TAK> tokens that utilize MLLM's autoregressive learning to effectively capture both local and global information. Subsequently, we apply a similarity-based weighted fusion and frame selection strategy, then utilize SAM2 to perform keyframe segmentation and propagation. To enhance keyframe localization accuracy, the TKS filters keyframes based on SAM2's occlusion scores during inference. VRS-HQ achieves state-of-the-art performance on ReVOS, surpassing VISA by 5.9%/12.5%/9.1% in J&F scores across the three subsets. These results highlight the strong temporal reasoning and segmentation capabilities of our method. Code and model weights will be released at VRS-HQ.
Paper Structure (36 sections, 6 equations, 11 figures, 9 tables)

This paper contains 36 sections, 6 equations, 11 figures, 9 tables.

Figures (11)

  • Figure 1: Comparison with previous VRS approaches. (a) Previous methods utilize a single <SEG> token for keyframe-based segmentation, depending heavily on external models for keyframe detection and mask propagation. This reliance can hinder accurate keyframe localization and prevent end-to-end inference. (b) In contrast, VRS-HQ introduces frame-level <SEG> and a temporal <TAK> token for dynamic aggregation. The aggregated <TAK> token is then used for both keyframe selection and mask generation within SAM2. This enables single-stage inference with precise keyframe selection and high-quality segmentation. (c) VRS-HQ achieves state-of-the-art performance on various image and video datasets across reasoning and referring segmentation.
  • Figure 2: (a) VRS-HQ architecture. VRS-HQ incorporates a Multimodal Large Language Model for Temporal Token Encoding (<SEG> and <TAK> tokens, §\ref{['sec3.1']}), a Temporal Dynamic Aggregation, a Token-driven Keyframe Selection and Mask Decoding and Propogation. (b) Temporal Dynamic Aggregation (TDA) merges frame-level <SEG> tokens into a temporal <TAK> token using a weighted fusion based on cosine similarity. (§\ref{['sec3.2']}). (c) Token-driven Keyframe Selection (TKS). During training, the frame with the <SEG> token closest to the <TAK> token is selected as the keyframe. During inference, keyframe selection is refined using SAM2's occlusion scores and token similarity scores (§\ref{['sec3.3']}). (d) Mask Decoding and Propagation (MDP). The <TAK> token provides a sparse embedding for SAM2, generating a keyframe mask and propagating it to other frames via a memory mechanism (§\ref{['sec3.4']}).
  • Figure 3: Segmentation map comparison of VISA and VRS-HQ on the ReVOS benchmark (§\ref{['results']}). Results across three scenarios demonstrate that VRS-HQ excels in reasoning complex spatial and temporal relationships, delivering enhanced segmentation performance.
  • Figure 4: Visualization of feature maps (§\ref{['feature']}). From top to bottom are: (a) Ground truth masks. b) Keyframe mask embeddings generated by the <TAK> token before TDA. (c) Keyframe mask embeddings generated by the <TAK> token after TDA.
  • Figure 5: Details of SAM2 for mask decoding and propagation. All the video frames are input into the image encoder for feature extraction. The feature embeddings of the keyframe interact with $h'_{tak}$ through the mask decoder for mask generation and then propagate it to the remaining video frames via the memory mechanism.
  • ...and 6 more figures