Table of Contents
Fetching ...

LTCA: Long-range Temporal Context Attention for Referring Video Object Segmentation

Cilin Yan, Jingyun Wang, Guoliang Kang

TL;DR

LTCA addresses RVOS by modeling long-range temporal context through a dual-path attention scheme: stacked sparse local attentions via dilated window and random attention to balance locality and globality, plus a global-query pathway to directly encode long-range context with linear complexity in the number of frames. The framework uses a Frame Object Extractor to obtain per-frame object embeddings, a linguistically informed global query set, and a Mask Generator to produce frame-wise segmentation masks, enabling direct cross-frame mask generation without a video-object generator. Empirical results on MeViS, Ref-YouTube-VOS, Ref-DAVIS17, and A2D-Sentences show state-of-the-art performance, with pronounced gains on motion-rich data like MeViS. The approach offers a scalable, robust solution for long-video RVOS and provides a foundation for future long-video multi-modal reasoning tasks.

Abstract

Referring Video Segmentation (RVOS) aims to segment objects in videos given linguistic expressions. The key to solving RVOS is to extract long-range temporal context information from the interactions of expressions and videos to depict the dynamic attributes of each object. Previous works either adopt attention across all the frames or stack dense local attention to achieve a global view of temporal context. However, they fail to strike a good balance between locality and globality, and the computation complexity significantly increases with the increase of video length. In this paper, we propose an effective long-range temporal context attention (LTCA) mechanism to aggregate global context information into object features. Specifically, we aggregate the global context information from two aspects. Firstly, we stack sparse local attentions to balance the locality and globality. We design a dilated window attention across frames to aggregate local context information and perform such attention in a stack of layers to enable a global view. Further, we enable each query to attend to a small group of keys randomly selected from a global pool to enhance the globality. Secondly, we design a global query to interact with all the other queries to directly encode the global context information. Experiments show our method achieves new state-of-the-art on four referring video segmentation benchmarks. Notably, our method shows an improvement of 11.3% and 8.3% on the MeViS valu and val datasets respectively.

LTCA: Long-range Temporal Context Attention for Referring Video Object Segmentation

TL;DR

LTCA addresses RVOS by modeling long-range temporal context through a dual-path attention scheme: stacked sparse local attentions via dilated window and random attention to balance locality and globality, plus a global-query pathway to directly encode long-range context with linear complexity in the number of frames. The framework uses a Frame Object Extractor to obtain per-frame object embeddings, a linguistically informed global query set, and a Mask Generator to produce frame-wise segmentation masks, enabling direct cross-frame mask generation without a video-object generator. Empirical results on MeViS, Ref-YouTube-VOS, Ref-DAVIS17, and A2D-Sentences show state-of-the-art performance, with pronounced gains on motion-rich data like MeViS. The approach offers a scalable, robust solution for long-video RVOS and provides a foundation for future long-video multi-modal reasoning tasks.

Abstract

Referring Video Segmentation (RVOS) aims to segment objects in videos given linguistic expressions. The key to solving RVOS is to extract long-range temporal context information from the interactions of expressions and videos to depict the dynamic attributes of each object. Previous works either adopt attention across all the frames or stack dense local attention to achieve a global view of temporal context. However, they fail to strike a good balance between locality and globality, and the computation complexity significantly increases with the increase of video length. In this paper, we propose an effective long-range temporal context attention (LTCA) mechanism to aggregate global context information into object features. Specifically, we aggregate the global context information from two aspects. Firstly, we stack sparse local attentions to balance the locality and globality. We design a dilated window attention across frames to aggregate local context information and perform such attention in a stack of layers to enable a global view. Further, we enable each query to attend to a small group of keys randomly selected from a global pool to enhance the globality. Secondly, we design a global query to interact with all the other queries to directly encode the global context information. Experiments show our method achieves new state-of-the-art on four referring video segmentation benchmarks. Notably, our method shows an improvement of 11.3% and 8.3% on the MeViS valu and val datasets respectively.

Paper Structure

This paper contains 18 sections, 12 equations, 6 figures, 12 tables.

Figures (6)

  • Figure 1: Comparison of current query-based referring video segmentation (RVOS) pipelines. (a) Modeling frame feature sequence using full attention, (b) Modeling frame object sequence using shift window attention, (c) Modeling frame object sequence using LTCA.
  • Figure 2: The overview architecture of our method. First, all input frames are fed to a transformer-based extractor to generate object-centric embeddings $\{E_{f}^t\}_{t=1}^T$. Then we flatten $\{E_{f}^t\}_{t=1}^T$ as frame object queries $Q_o$ and concatenate it with a set of learnable global queries $Q_g$, which is initialized with text embedding of given linguistic expressions. Then the concatenated queries are fed to an LTCA module to conduct efficient information interaction among frame object queries $Q_o$ and linguistic-aware global queries $Q_g$. The output global queries $\widetilde{Q}_g$ are adopted to generate segmentation mask sequences of target objects.
  • Figure 3: Visualization of different attention patterns. For ease of visualization, we set $N_1=2$, $N_2=2$, $w = 3$, $d = 2$ and $r = 1$.
  • Figure 4: Visualization Results on MeViS. We compare the visualization results between LMPM and our method on MeViS. Previous work easily segments irrelevant objects, as this method fails to strike a good balance between locality and globality.
  • Figure 5: Visualization Results on Ref-YouTube-VOS. We compare the visualization results between MUTR and our method on Ref-YouTube-VOS benchmarks.
  • ...and 1 more figures