Table of Contents
Fetching ...

Video Object Segmentation with Dynamic Query Modulation

Hantao Zhou, Runze Hu, Xiu Li

TL;DR

This work proposes a query modulation method, termed QMVOS, that summarizes object features into dynamic queries and then treats them as dynamic filters for mask prediction, thereby providing high-level descriptions and object-level perception for the model.

Abstract

Storing intermediate frame segmentations as memory for long-range context modeling, spatial-temporal memory-based methods have recently showcased impressive results in semi-supervised video object segmentation (SVOS). However, these methods face two key limitations: 1) relying on non-local pixel-level matching to read memory, resulting in noisy retrieved features for segmentation; 2) segmenting each object independently without interaction. These shortcomings make the memory-based methods struggle in similar object and multi-object segmentation. To address these issues, we propose a query modulation method, termed QMVOS. This method summarizes object features into dynamic queries and then treats them as dynamic filters for mask prediction, thereby providing high-level descriptions and object-level perception for the model. Efficient and effective multi-object interactions are realized through inter-query attention. Extensive experiments demonstrate that our method can bring significant improvements to the memory-based SVOS method and achieve competitive performance on standard SVOS benchmarks. The code is available at https://github.com/zht8506/QMVOS.

Video Object Segmentation with Dynamic Query Modulation

TL;DR

This work proposes a query modulation method, termed QMVOS, that summarizes object features into dynamic queries and then treats them as dynamic filters for mask prediction, thereby providing high-level descriptions and object-level perception for the model.

Abstract

Storing intermediate frame segmentations as memory for long-range context modeling, spatial-temporal memory-based methods have recently showcased impressive results in semi-supervised video object segmentation (SVOS). However, these methods face two key limitations: 1) relying on non-local pixel-level matching to read memory, resulting in noisy retrieved features for segmentation; 2) segmenting each object independently without interaction. These shortcomings make the memory-based methods struggle in similar object and multi-object segmentation. To address these issues, we propose a query modulation method, termed QMVOS. This method summarizes object features into dynamic queries and then treats them as dynamic filters for mask prediction, thereby providing high-level descriptions and object-level perception for the model. Efficient and effective multi-object interactions are realized through inter-query attention. Extensive experiments demonstrate that our method can bring significant improvements to the memory-based SVOS method and achieve competitive performance on standard SVOS benchmarks. The code is available at https://github.com/zht8506/QMVOS.
Paper Structure (14 sections, 4 equations, 3 figures, 3 tables)

This paper contains 14 sections, 4 equations, 3 figures, 3 tables.

Figures (3)

  • Figure 1: The pipeline of existing memory-based SVOS works (a) and our QMVOS (b). Our method innovatively introduces object queries to VOS, enabling effective object-level perception, multi-object interaction and dynamic prediction.
  • Figure 2: (a) The main pipeline of our framework, which follows typical memory-based methods and introduces queries to achieve object-level perception. (b) The structure of SIM. It consists of multi-scale fusion and multi-object interaction processes to initialize queries. (c) The structure of QCIM. It is utilized to perform query-content interaction.
  • Figure 3: Qualitative comparisons of our QMVOS with the SOTA memory-based work, XMem cheng2022xmem. We mark failures in the white dashed boxes. Our model outperforms XMem in terms of detailing and discriminating similarities.