Table of Contents
Fetching ...

Temporally Consistent Referring Video Object Segmentation with Hybrid Memory

Bo Miao, Mohammed Bennamoun, Yongsheng Gao, Mubarak Shah, Ajmal Mian

TL;DR

This work addresses temporal inconsistency in Referring Video Object Segmentation (R-VOS) by introducing HTR, an end-to-end paradigm that explicitly models temporal instance consistency alongside referring segmentation. It leverages a Hybrid Memory composed of local memory for fine-grained propagation and global tokens for robust global context, enabling inter-frame collaboration that propagates high-confidence reference information to remaining frames. A selective referring segmentation mechanism identifies frames with reliable reference masks, while the memory-based propagation maintains coherent segmentations across time and improves mask quality. A new Mask Consistency Score (MCS) quantifies temporal stability, and HTR achieves top results on Ref-YouTube-VOS and Ref-DAVIS17, with strong performance on A2D-Sentences, JHMDB-Sentences, and MeViS, while maintaining efficient runtimes.

Abstract

Referring Video Object Segmentation (R-VOS) methods face challenges in maintaining consistent object segmentation due to temporal context variability and the presence of other visually similar objects. We propose an end-to-end R-VOS paradigm that explicitly models temporal instance consistency alongside the referring segmentation. Specifically, we introduce a novel hybrid memory that facilitates inter-frame collaboration for robust spatio-temporal matching and propagation. Features of frames with automatically generated high-quality reference masks are propagated to segment the remaining frames based on multi-granularity association to achieve temporally consistent R-VOS. Furthermore, we propose a new Mask Consistency Score (MCS) metric to evaluate the temporal consistency of video segmentation. Extensive experiments demonstrate that our approach enhances temporal consistency by a significant margin, leading to top-ranked performance on popular R-VOS benchmarks, i.e., Ref-YouTube-VOS (67.1%) and Ref-DAVIS17 (65.6%). The code is available at https://github.com/bo-miao/HTR.

Temporally Consistent Referring Video Object Segmentation with Hybrid Memory

TL;DR

This work addresses temporal inconsistency in Referring Video Object Segmentation (R-VOS) by introducing HTR, an end-to-end paradigm that explicitly models temporal instance consistency alongside referring segmentation. It leverages a Hybrid Memory composed of local memory for fine-grained propagation and global tokens for robust global context, enabling inter-frame collaboration that propagates high-confidence reference information to remaining frames. A selective referring segmentation mechanism identifies frames with reliable reference masks, while the memory-based propagation maintains coherent segmentations across time and improves mask quality. A new Mask Consistency Score (MCS) quantifies temporal stability, and HTR achieves top results on Ref-YouTube-VOS and Ref-DAVIS17, with strong performance on A2D-Sentences, JHMDB-Sentences, and MeViS, while maintaining efficient runtimes.

Abstract

Referring Video Object Segmentation (R-VOS) methods face challenges in maintaining consistent object segmentation due to temporal context variability and the presence of other visually similar objects. We propose an end-to-end R-VOS paradigm that explicitly models temporal instance consistency alongside the referring segmentation. Specifically, we introduce a novel hybrid memory that facilitates inter-frame collaboration for robust spatio-temporal matching and propagation. Features of frames with automatically generated high-quality reference masks are propagated to segment the remaining frames based on multi-granularity association to achieve temporally consistent R-VOS. Furthermore, we propose a new Mask Consistency Score (MCS) metric to evaluate the temporal consistency of video segmentation. Extensive experiments demonstrate that our approach enhances temporal consistency by a significant margin, leading to top-ranked performance on popular R-VOS benchmarks, i.e., Ref-YouTube-VOS (67.1%) and Ref-DAVIS17 (65.6%). The code is available at https://github.com/bo-miao/HTR.
Paper Structure (42 sections, 11 equations, 5 figures, 10 tables)

This paper contains 42 sections, 11 equations, 5 figures, 10 tables.

Figures (5)

  • Figure 1: (a) Our HTR automatically generates reference masks and achieves temporally consistent R-VOS in an end-to-end manner using the robust hybrid memory. (b) The baseline model ReferFormer ReferFormer loses track of the target in some frames (marked by $\XBox$) whereas ours maintains temporal consistency in segmenting the correct object.
  • Figure 2: Detailed architecture of HTR. (a) Selective referring process predicts the score $\mathcal{S}$ and conditional kernels $\theta_{ck}$ for each frame to selectively segments frames with high scores. Masks and visual features of these selected reference frames, and only the visual features of the remaining target frames are passed to (b) Inter-frame collaboration module. This module encodes reference frames to construct hybrid memory and aggregates memory features based on node-node (local memory) and node-object (global token) affinity to segment each pixel node in target frames. $\mathcal{F}^{s}$/$\mathcal{F}^{w}$: sentence/word features.
  • Figure 3: Visualization of the propagated features using our hybrid memory and a standard local memory STCN. Our memory demonstrates robust feature propagation in challenging scenarios.
  • Figure 4: Qualitative results on Ref-YouTube-VOS and MeViS. (a) Our HTR predicts temporally consistent results compared to ReferFormer ReferFormer and SgMg sgmg. (b) HTR can handle appearance and motion expressions in various challenging scenarios, including fast motion, objects with similar appearance, occlusion, and small objects. (c) HTR fails to improve mask quality without good reference masks.
  • Figure 5: Visualization of the affinity between the vision-mask joint representations of target frames and the global foreground representation in the hybrid memory. Our feature aggregation extracts robust global representations to localize the correct targets.