Table of Contents
Fetching ...

Decoupling Static and Hierarchical Motion Perception for Referring Video Segmentation

Shuting He, Henghui Ding

TL;DR

This work tackles the challenge of referring video segmentation by decoupling static grounding from motion reasoning. It introduces an expression-decoupling module to separate static cues $F_s$ from motion cues $F_m$, and a Hierarchical Motion Perception (HMP) module to capture multi-scale temporal information via object-token trajectories and progressive merging. An object-wise contrastive learning framework with a memory bank enhances discrimination among visually similar motions, further boosting temporal understanding. The approach achieves state-of-the-art results across five datasets, with a notable $9.2\%$ improvement in $J\&F$ on MeViS, demonstrating strong gains in motion-aware video-language grounding and generalization to diverse benchmarks.

Abstract

Referring video segmentation relies on natural language expressions to identify and segment objects, often emphasizing motion clues. Previous works treat a sentence as a whole and directly perform identification at the video-level, mixing up static image-level cues with temporal motion cues. However, image-level features cannot well comprehend motion cues in sentences, and static cues are not crucial for temporal perception. In fact, static cues can sometimes interfere with temporal perception by overshadowing motion cues. In this work, we propose to decouple video-level referring expression understanding into static and motion perception, with a specific emphasis on enhancing temporal comprehension. Firstly, we introduce an expression-decoupling module to make static cues and motion cues perform their distinct role, alleviating the issue of sentence embeddings overlooking motion cues. Secondly, we propose a hierarchical motion perception module to capture temporal information effectively across varying timescales. Furthermore, we employ contrastive learning to distinguish the motions of visually similar objects. These contributions yield state-of-the-art performance across five datasets, including a remarkable $\textbf{9.2%}$ $\mathcal{J\&F}$ improvement on the challenging $\textbf{MeViS}$ dataset. Code is available at https://github.com/heshuting555/DsHmp.

Decoupling Static and Hierarchical Motion Perception for Referring Video Segmentation

TL;DR

This work tackles the challenge of referring video segmentation by decoupling static grounding from motion reasoning. It introduces an expression-decoupling module to separate static cues from motion cues , and a Hierarchical Motion Perception (HMP) module to capture multi-scale temporal information via object-token trajectories and progressive merging. An object-wise contrastive learning framework with a memory bank enhances discrimination among visually similar motions, further boosting temporal understanding. The approach achieves state-of-the-art results across five datasets, with a notable improvement in on MeViS, demonstrating strong gains in motion-aware video-language grounding and generalization to diverse benchmarks.

Abstract

Referring video segmentation relies on natural language expressions to identify and segment objects, often emphasizing motion clues. Previous works treat a sentence as a whole and directly perform identification at the video-level, mixing up static image-level cues with temporal motion cues. However, image-level features cannot well comprehend motion cues in sentences, and static cues are not crucial for temporal perception. In fact, static cues can sometimes interfere with temporal perception by overshadowing motion cues. In this work, we propose to decouple video-level referring expression understanding into static and motion perception, with a specific emphasis on enhancing temporal comprehension. Firstly, we introduce an expression-decoupling module to make static cues and motion cues perform their distinct role, alleviating the issue of sentence embeddings overlooking motion cues. Secondly, we propose a hierarchical motion perception module to capture temporal information effectively across varying timescales. Furthermore, we employ contrastive learning to distinguish the motions of visually similar objects. These contributions yield state-of-the-art performance across five datasets, including a remarkable improvement on the challenging dataset. Code is available at https://github.com/heshuting555/DsHmp.
Paper Structure (14 sections, 9 equations, 5 figures, 5 tables)

This paper contains 14 sections, 9 equations, 5 figures, 5 tables.

Figures (5)

  • Figure 1: Previous works treat a sentence as a whole and perform referring understanding at the video-level. However, image-level features struggle to understand motion cues, and static cues can sometimes disrupt temporal perception by overshadowing motion cues. We introduce a decoupling of static and motion perception, with a particular focus on enhancing temporal understanding.
  • Figure 2: Overview of the proposed approach, named as DsHmp. We decouple the referring video segmentation to image-level static perception and temporal-level motion perception. We first employ Mask2Former to segment the possible objects according to static cues $F_s$. Then based on the object tokens $\mathcal{O}$ generated by Mask2Former, a hierarchical motion perception is employed to gradually comprehend temporal motions from short-term to long-term. Next, we employ a Motion Decoder to identify the target object according to motion cues $F_m$ and produce video tokens $\mathcal{V}$, which are used for mask predictions. Contrastive learning is applied on video tokens to help the model differentiate visually similar objects with distinct motion patterns.
  • Figure 3: Architecture of the proposed Hierarchical Motion Perception (HMP). Hierarchical Motion Perception module effectively processes short-term and long-term motions, enabling the capture of motion patterns spanning various frame intervals.
  • Figure 4: Visualization of features learned w/o CL (left) and w/ CL (right). Features are colored according to class labels. As seen, the proposed CL brings a well-structured video token feature space.
  • Figure 5: Visualization results of complex and motion language descriptions on MeViS. Orange masks represent positive segmentation results and pink masks denote the negatives. Our DsHmp can capture temporal information effectively across varying timescales.