Table of Contents
Fetching ...

Full-Duplex Strategy for Video Object Segmentation

Ge-Peng Ji, Deng-Ping Fan, Keren Fu, Zhe Wu, Jianbing Shen, Ling Shao

TL;DR

FSNet proposes a full-duplex approach to video object segmentation by enabling bidirectional cross-modal interaction between appearance and motion through Relational Cross-Attention Modules (RCAM) and Bidirectional Purification Modules (BPM). The architecture conducts cross-modal feature fusion in the encoder via RCAM and refines features in the decoder with cascaded BPMs, improving robustness to motion and appearance inconsistencies. Empirical results on DAVIS$_{16}$, MCL, FBMS, SegTrack-V2, and DAVSOD$_{19}$ demonstrate state-of-the-art unsupervised VOS and strong V-SOD performance, with notable gains in metrics such as $S_ ext{α}$, $E_ ext{ξ}^{max}$, and $F_{eta}^{max}$ and favorable data efficiency. The work provides a unified, efficient framework for both U-VOS and V-SOD, with practical inference speed and publicly available code.

Abstract

Previous video object segmentation approaches mainly focus on using simplex solutions between appearance and motion, limiting feature collaboration efficiency among and across these two cues. In this work, we study a novel and efficient full-duplex strategy network (FSNet) to address this issue, by considering a better mutual restraint scheme between motion and appearance in exploiting the cross-modal features from the fusion and decoding stage. Specifically, we introduce the relational cross-attention module (RCAM) to achieve bidirectional message propagation across embedding sub-spaces. To improve the model's robustness and update the inconsistent features from the spatial-temporal embeddings, we adopt the bidirectional purification module (BPM) after the RCAM. Extensive experiments on five popular benchmarks show that our FSNet is robust to various challenging scenarios (e.g., motion blur, occlusion) and achieves favourable performance against existing cutting-edges both in the video object segmentation and video salient object detection tasks. The project is publicly available at: https://dpfan.net/FSNet.

Full-Duplex Strategy for Video Object Segmentation

TL;DR

FSNet proposes a full-duplex approach to video object segmentation by enabling bidirectional cross-modal interaction between appearance and motion through Relational Cross-Attention Modules (RCAM) and Bidirectional Purification Modules (BPM). The architecture conducts cross-modal feature fusion in the encoder via RCAM and refines features in the decoder with cascaded BPMs, improving robustness to motion and appearance inconsistencies. Empirical results on DAVIS, MCL, FBMS, SegTrack-V2, and DAVSOD demonstrate state-of-the-art unsupervised VOS and strong V-SOD performance, with notable gains in metrics such as , , and and favorable data efficiency. The work provides a unified, efficient framework for both U-VOS and V-SOD, with practical inference speed and publicly available code.

Abstract

Previous video object segmentation approaches mainly focus on using simplex solutions between appearance and motion, limiting feature collaboration efficiency among and across these two cues. In this work, we study a novel and efficient full-duplex strategy network (FSNet) to address this issue, by considering a better mutual restraint scheme between motion and appearance in exploiting the cross-modal features from the fusion and decoding stage. Specifically, we introduce the relational cross-attention module (RCAM) to achieve bidirectional message propagation across embedding sub-spaces. To improve the model's robustness and update the inconsistent features from the spatial-temporal embeddings, we adopt the bidirectional purification module (BPM) after the RCAM. Extensive experiments on five popular benchmarks show that our FSNet is robust to various challenging scenarios (e.g., motion blur, occlusion) and achieves favourable performance against existing cutting-edges both in the video object segmentation and video salient object detection tasks. The project is publicly available at: https://dpfan.net/FSNet.

Paper Structure

This paper contains 43 sections, 10 equations, 9 figures, 7 tables.

Figures (9)

  • Figure 1: Comparison between three strategies for embedding appearance and motion patterns before the fusion and decoding stage. (a) Direction-independent strategy jain2017fusionseg without information transmission, (b) Simplex strategy zhou2020motion_attentive with only unidirectional information transmission, e.g., using motion guides appearance or vice versa, and (c) our full-duplex strategy with simultaneously bidirectional information transmission. This paper mainly focuses on discussing directional modelling (b & c) in the deep learning era.
  • Figure 2: Visual comparison between the simplex (i.e., (a) appearance-refined motion and (b) motion-refined appearance) and our full-duplex strategy under our framework. In contrast, our FSNet offers a collaborative way to leverage the appearance and motion cues under the mutual restraint of full-duplex strategy, thus providing more accurate structure details and alleviating the short-term feature drifting issue yang2019anchor.
  • Figure 3: The architecture of our FSNet for video object segmentation. The Relational Cross-Attention Module (RCAM) abstracts more discriminative representations between the motion and appearance cues using the full-duplex strategy. Then four Bidirectional Purification Modules (BPM) are stacked to further re-calibrate inconsistencies between the motion and appearance features. Finally, we utilize the decoder to generate our prediction.
  • Figure 4: Illustration of our Relational Cross-Attention Module (RCAM) with a simplex (a & b) and full-duplex (c) strategy.
  • Figure 5: Illustration of our Bidirectional Purification Module (BPM) with a simplex and full-duplex strategy.
  • ...and 4 more figures