Video Object Segmentation with Dynamic Query Modulation

Hantao Zhou; Runze Hu; Xiu Li

Video Object Segmentation with Dynamic Query Modulation

Hantao Zhou, Runze Hu, Xiu Li

TL;DR

This work proposes a query modulation method, termed QMVOS, that summarizes object features into dynamic queries and then treats them as dynamic filters for mask prediction, thereby providing high-level descriptions and object-level perception for the model.

Abstract

Storing intermediate frame segmentations as memory for long-range context modeling, spatial-temporal memory-based methods have recently showcased impressive results in semi-supervised video object segmentation (SVOS). However, these methods face two key limitations: 1) relying on non-local pixel-level matching to read memory, resulting in noisy retrieved features for segmentation; 2) segmenting each object independently without interaction. These shortcomings make the memory-based methods struggle in similar object and multi-object segmentation. To address these issues, we propose a query modulation method, termed QMVOS. This method summarizes object features into dynamic queries and then treats them as dynamic filters for mask prediction, thereby providing high-level descriptions and object-level perception for the model. Efficient and effective multi-object interactions are realized through inter-query attention. Extensive experiments demonstrate that our method can bring significant improvements to the memory-based SVOS method and achieve competitive performance on standard SVOS benchmarks. The code is available at https://github.com/zht8506/QMVOS.

Video Object Segmentation with Dynamic Query Modulation

TL;DR

Abstract

Paper Structure (14 sections, 4 equations, 3 figures, 3 tables)

This paper contains 14 sections, 4 equations, 3 figures, 3 tables.

Introduction
Related work
Semi-supervised Video Object Segmentation
Query-based Method
method
Overview
Scale-aware Interaction Module
Query-Content Interaction Module
Implementation Details
experiments
Datasets and Evaluation Metrics
Main Results
Ablation Study
Conclusion

Figures (3)

Figure 1: The pipeline of existing memory-based SVOS works (a) and our QMVOS (b). Our method innovatively introduces object queries to VOS, enabling effective object-level perception, multi-object interaction and dynamic prediction.
Figure 2: (a) The main pipeline of our framework, which follows typical memory-based methods and introduces queries to achieve object-level perception. (b) The structure of SIM. It consists of multi-scale fusion and multi-object interaction processes to initialize queries. (c) The structure of QCIM. It is utilized to perform query-content interaction.
Figure 3: Qualitative comparisons of our QMVOS with the SOTA memory-based work, XMem cheng2022xmem. We mark failures in the white dashed boxes. Our model outperforms XMem in terms of detailing and discriminating similarities.

Video Object Segmentation with Dynamic Query Modulation

TL;DR

Abstract

Video Object Segmentation with Dynamic Query Modulation

Authors

TL;DR

Abstract

Table of Contents

Figures (3)