Table of Contents
Fetching ...

Rethinking Cross-modal Interaction from a Top-down Perspective for Referring Video Object Segmentation

Chen Liang, Yu Wu, Tianfei Zhou, Wenguan Wang, Zongxin Yang, Yunchao Wei, Yi Yang

TL;DR

This work tackles Referring Video Object Segmentation (RVOS) by shifting from bottom-up grid-level grounding to a top-down, object-centric strategy. It introduces a two-stage framework: (i) exhaustive Object Tracklet Construction, where masks from sampled frames are propagated to form a comprehensive set of tracklets with Tracklet-NMS pruning, and (ii) Tracklet-Language Grounding, where a Transformer-based module jointly models intra-object relations and cross-modal interactions to ground the referring expression on the tracklets. Key contributions include the explicit use of object-level cues via tracklets, the tracklet-NMS mechanism to reduce redundancy, and a robust Transformer-based grounding module for cross-modal reasoning. The approach achieves state-of-the-art results on the Referring Youtube-VOS challenge, with strong ablations validating the importance of each component, and demonstrates that object-centric reasoning yields robust RVOS in challenging scenes.

Abstract

Referring video object segmentation (RVOS) aims to segment video objects with the guidance of natural language reference. Previous methods typically tackle RVOS through directly grounding linguistic reference over the image lattice. Such bottom-up strategy fails to explore object-level cues, easily leading to inferior results. In this work, we instead put forward a two-stage, top-down RVOS solution. First, an exhaustive set of object tracklets is constructed by propagating object masks detected from several sampled frames to the entire video. Second, a Transformer-based tracklet-language grounding module is proposed, which models instance-level visual relations and cross-modal interactions simultaneously and efficiently. Our model ranks first place on CVPR2021 Referring Youtube-VOS challenge.

Rethinking Cross-modal Interaction from a Top-down Perspective for Referring Video Object Segmentation

TL;DR

This work tackles Referring Video Object Segmentation (RVOS) by shifting from bottom-up grid-level grounding to a top-down, object-centric strategy. It introduces a two-stage framework: (i) exhaustive Object Tracklet Construction, where masks from sampled frames are propagated to form a comprehensive set of tracklets with Tracklet-NMS pruning, and (ii) Tracklet-Language Grounding, where a Transformer-based module jointly models intra-object relations and cross-modal interactions to ground the referring expression on the tracklets. Key contributions include the explicit use of object-level cues via tracklets, the tracklet-NMS mechanism to reduce redundancy, and a robust Transformer-based grounding module for cross-modal reasoning. The approach achieves state-of-the-art results on the Referring Youtube-VOS challenge, with strong ablations validating the importance of each component, and demonstrates that object-centric reasoning yields robust RVOS in challenging scenes.

Abstract

Referring video object segmentation (RVOS) aims to segment video objects with the guidance of natural language reference. Previous methods typically tackle RVOS through directly grounding linguistic reference over the image lattice. Such bottom-up strategy fails to explore object-level cues, easily leading to inferior results. In this work, we instead put forward a two-stage, top-down RVOS solution. First, an exhaustive set of object tracklets is constructed by propagating object masks detected from several sampled frames to the entire video. Second, a Transformer-based tracklet-language grounding module is proposed, which models instance-level visual relations and cross-modal interactions simultaneously and efficiently. Our model ranks first place on CVPR2021 Referring Youtube-VOS challenge.

Paper Structure

This paper contains 4 sections, 6 equations, 3 figures, 3 tables.

Figures (3)

  • Figure 1: An illustration of our motivation. Previous bottom-up methods (a) perform cross-modal interaction at grid level, and fail to capture crucial object-level relations as top-down approach (b).
  • Figure 2: Pipeline of our proposed method, which contains two major stages, i.e., object tracklet generation (left column) and tracklet-language grounding (right column).
  • Figure 3: Representative visual results on RVOS-D test-challenge set. Each referent and the corresponding textual description are highlighted in the same color.