Space-time Reinforcement Network for Video Object Segmentation
Yadang Chen, Wentao Zhu, Zhi-Xin Yang, Enhua Wu
TL;DR
This work tackles two core problems in memory-based semi-supervised video object segmentation: disrupted space-time coherence under occlusion/fast motion and undesired pixel-level mismatches from noise. It introduces SRNet, which combines a Feature Alignment Module that generates an auxiliary frame as a short temporal reference with a Prototype Transformer Module that enables prototype-level matching via iterative cross-attention, controlled by position embeddings. The approach yields a strong DAVIS 2017 performance (J&F 86.4%) and competitive YouTube-VOS 2018 results (85.0%), while delivering real-time inference speeds around 32 FPS. By alleviating both coherence loss and distractor sensitivity, SRNet offers robust, fast VOS suitable for practical applications in autonomous systems and video editing.
Abstract
Recently, video object segmentation (VOS) networks typically use memory-based methods: for each query frame, the mask is predicted by space-time matching to memory frames. Despite these methods having superior performance, they suffer from two issues: 1) Challenging data can destroy the space-time coherence between adjacent video frames. 2) Pixel-level matching will lead to undesired mismatching caused by the noises or distractors. To address the aforementioned issues, we first propose to generate an auxiliary frame between adjacent frames, serving as an implicit short-temporal reference for the query one. Next, we learn a prototype for each video object and prototype-level matching can be implemented between the query and memory. The experiment demonstrated that our network outperforms the state-of-the-art method on the DAVIS 2017, achieving a J&F score of 86.4%, and attains a competitive result 85.0% on YouTube VOS 2018. In addition, our network exhibits a high inference speed of 32+ FPS.
