Rethinking Cross-modal Interaction from a Top-down Perspective for Referring Video Object Segmentation
Chen Liang, Yu Wu, Tianfei Zhou, Wenguan Wang, Zongxin Yang, Yunchao Wei, Yi Yang
TL;DR
This work tackles Referring Video Object Segmentation (RVOS) by shifting from bottom-up grid-level grounding to a top-down, object-centric strategy. It introduces a two-stage framework: (i) exhaustive Object Tracklet Construction, where masks from sampled frames are propagated to form a comprehensive set of tracklets with Tracklet-NMS pruning, and (ii) Tracklet-Language Grounding, where a Transformer-based module jointly models intra-object relations and cross-modal interactions to ground the referring expression on the tracklets. Key contributions include the explicit use of object-level cues via tracklets, the tracklet-NMS mechanism to reduce redundancy, and a robust Transformer-based grounding module for cross-modal reasoning. The approach achieves state-of-the-art results on the Referring Youtube-VOS challenge, with strong ablations validating the importance of each component, and demonstrates that object-centric reasoning yields robust RVOS in challenging scenes.
Abstract
Referring video object segmentation (RVOS) aims to segment video objects with the guidance of natural language reference. Previous methods typically tackle RVOS through directly grounding linguistic reference over the image lattice. Such bottom-up strategy fails to explore object-level cues, easily leading to inferior results. In this work, we instead put forward a two-stage, top-down RVOS solution. First, an exhaustive set of object tracklets is constructed by propagating object masks detected from several sampled frames to the entire video. Second, a Transformer-based tracklet-language grounding module is proposed, which models instance-level visual relations and cross-modal interactions simultaneously and efficiently. Our model ranks first place on CVPR2021 Referring Youtube-VOS challenge.
