Temporally Consistent Referring Video Object Segmentation with Hybrid Memory
Bo Miao, Mohammed Bennamoun, Yongsheng Gao, Mubarak Shah, Ajmal Mian
TL;DR
This work addresses temporal inconsistency in Referring Video Object Segmentation (R-VOS) by introducing HTR, an end-to-end paradigm that explicitly models temporal instance consistency alongside referring segmentation. It leverages a Hybrid Memory composed of local memory for fine-grained propagation and global tokens for robust global context, enabling inter-frame collaboration that propagates high-confidence reference information to remaining frames. A selective referring segmentation mechanism identifies frames with reliable reference masks, while the memory-based propagation maintains coherent segmentations across time and improves mask quality. A new Mask Consistency Score (MCS) quantifies temporal stability, and HTR achieves top results on Ref-YouTube-VOS and Ref-DAVIS17, with strong performance on A2D-Sentences, JHMDB-Sentences, and MeViS, while maintaining efficient runtimes.
Abstract
Referring Video Object Segmentation (R-VOS) methods face challenges in maintaining consistent object segmentation due to temporal context variability and the presence of other visually similar objects. We propose an end-to-end R-VOS paradigm that explicitly models temporal instance consistency alongside the referring segmentation. Specifically, we introduce a novel hybrid memory that facilitates inter-frame collaboration for robust spatio-temporal matching and propagation. Features of frames with automatically generated high-quality reference masks are propagated to segment the remaining frames based on multi-granularity association to achieve temporally consistent R-VOS. Furthermore, we propose a new Mask Consistency Score (MCS) metric to evaluate the temporal consistency of video segmentation. Extensive experiments demonstrate that our approach enhances temporal consistency by a significant margin, leading to top-ranked performance on popular R-VOS benchmarks, i.e., Ref-YouTube-VOS (67.1%) and Ref-DAVIS17 (65.6%). The code is available at https://github.com/bo-miao/HTR.
