Table of Contents
Fetching ...

Space-time Reinforcement Network for Video Object Segmentation

Yadang Chen, Wentao Zhu, Zhi-Xin Yang, Enhua Wu

TL;DR

This work tackles two core problems in memory-based semi-supervised video object segmentation: disrupted space-time coherence under occlusion/fast motion and undesired pixel-level mismatches from noise. It introduces SRNet, which combines a Feature Alignment Module that generates an auxiliary frame as a short temporal reference with a Prototype Transformer Module that enables prototype-level matching via iterative cross-attention, controlled by position embeddings. The approach yields a strong DAVIS 2017 performance (J&F 86.4%) and competitive YouTube-VOS 2018 results (85.0%), while delivering real-time inference speeds around 32 FPS. By alleviating both coherence loss and distractor sensitivity, SRNet offers robust, fast VOS suitable for practical applications in autonomous systems and video editing.

Abstract

Recently, video object segmentation (VOS) networks typically use memory-based methods: for each query frame, the mask is predicted by space-time matching to memory frames. Despite these methods having superior performance, they suffer from two issues: 1) Challenging data can destroy the space-time coherence between adjacent video frames. 2) Pixel-level matching will lead to undesired mismatching caused by the noises or distractors. To address the aforementioned issues, we first propose to generate an auxiliary frame between adjacent frames, serving as an implicit short-temporal reference for the query one. Next, we learn a prototype for each video object and prototype-level matching can be implemented between the query and memory. The experiment demonstrated that our network outperforms the state-of-the-art method on the DAVIS 2017, achieving a J&F score of 86.4%, and attains a competitive result 85.0% on YouTube VOS 2018. In addition, our network exhibits a high inference speed of 32+ FPS.

Space-time Reinforcement Network for Video Object Segmentation

TL;DR

This work tackles two core problems in memory-based semi-supervised video object segmentation: disrupted space-time coherence under occlusion/fast motion and undesired pixel-level mismatches from noise. It introduces SRNet, which combines a Feature Alignment Module that generates an auxiliary frame as a short temporal reference with a Prototype Transformer Module that enables prototype-level matching via iterative cross-attention, controlled by position embeddings. The approach yields a strong DAVIS 2017 performance (J&F 86.4%) and competitive YouTube-VOS 2018 results (85.0%), while delivering real-time inference speeds around 32 FPS. By alleviating both coherence loss and distractor sensitivity, SRNet offers robust, fast VOS suitable for practical applications in autonomous systems and video editing.

Abstract

Recently, video object segmentation (VOS) networks typically use memory-based methods: for each query frame, the mask is predicted by space-time matching to memory frames. Despite these methods having superior performance, they suffer from two issues: 1) Challenging data can destroy the space-time coherence between adjacent video frames. 2) Pixel-level matching will lead to undesired mismatching caused by the noises or distractors. To address the aforementioned issues, we first propose to generate an auxiliary frame between adjacent frames, serving as an implicit short-temporal reference for the query one. Next, we learn a prototype for each video object and prototype-level matching can be implemented between the query and memory. The experiment demonstrated that our network outperforms the state-of-the-art method on the DAVIS 2017, achieving a J&F score of 86.4%, and attains a competitive result 85.0% on YouTube VOS 2018. In addition, our network exhibits a high inference speed of 32+ FPS.
Paper Structure (12 sections, 10 equations, 10 figures, 2 tables)

This paper contains 12 sections, 10 equations, 10 figures, 2 tables.

Figures (10)

  • Figure 1: (a) t-SNE visualization of the difference between frames. Left: the feature maps of query and adjacent frames. Right: the feature maps of query and auxiliary frames. Our proposed auxiliary frame is more consistent with the query than the adjacent frame. (b) Comparison of pixel-level matching (top) and prototype-level matching (bottom). Orange arrows indicate wrong matches. We propose prototype-level matching to improve undesired mismatching.
  • Figure 2: An overview of SRNet. We propose a Feature Alignment Module (FAM) for generating an auxiliary frame to obtain the local feature and a Prototype Transformer Module (PTM) to implement prototype-level matching.
  • Figure 3: Implementation of Feature Alignment Module.
  • Figure 4: Implementation of Prototype Transformer Module.
  • Figure 5: Qualitative comparisons with SRNet, STCN 9 and Xmem 10 on the YouTube 2018 validation set and DAVIS 2017 validation set.
  • ...and 5 more figures