Table of Contents
Fetching ...

DVIS-DAQ: Improving Video Segmentation via Dynamic Anchor Queries

Yikang Zhou, Tao Zhang, Shunping Ji, Shuicheng Yan, Xiangtai Li

TL;DR

This work tackles the difficulty of handling newly emerging and disappearing objects in video segmentation with query-based methods, which suffer from a large feature transition gap between background anchors and foreground targets. It introduces Dynamic Anchor Queries (DAQ) to generate emergence and disappearance anchors from candidate frame features, and Emergence and Disappearance Simulation (EDS) to amplify training examples without extra cost, integrating them into the DVIS framework to produce DVIS-DAQ. Ablation studies show that DAQ reduces the feature transition gap and that EDS is crucial to fully unleash DAQ's potential, yielding robust handling of emergence/disappearance. Across five mainstream benchmarks, DVIS-DAQ achieves state-of-the-art results, demonstrating strong practical impact for long videos and real-world scenes.

Abstract

Modern video segmentation methods adopt object queries to perform inter-frame association and demonstrate satisfactory performance in tracking continuously appearing objects despite large-scale motion and transient occlusion. However, they all underperform on newly emerging and disappearing objects that are common in the real world because they attempt to model object emergence and disappearance through feature transitions between background and foreground queries that have significant feature gaps. We introduce Dynamic Anchor Queries (DAQ) to shorten the transition gap between the anchor and target queries by dynamically generating anchor queries based on the features of potential candidates. Furthermore, we introduce a query-level object Emergence and Disappearance Simulation (EDS) strategy, which unleashes DAQ's potential without any additional cost. Finally, we combine our proposed DAQ and EDS with DVIS to obtain DVIS-DAQ. Extensive experiments demonstrate that DVIS-DAQ achieves a new state-of-the-art (SOTA) performance on five mainstream video segmentation benchmarks. Code and models are available at \url{https://github.com/SkyworkAI/DAQ-VS}.

DVIS-DAQ: Improving Video Segmentation via Dynamic Anchor Queries

TL;DR

This work tackles the difficulty of handling newly emerging and disappearing objects in video segmentation with query-based methods, which suffer from a large feature transition gap between background anchors and foreground targets. It introduces Dynamic Anchor Queries (DAQ) to generate emergence and disappearance anchors from candidate frame features, and Emergence and Disappearance Simulation (EDS) to amplify training examples without extra cost, integrating them into the DVIS framework to produce DVIS-DAQ. Ablation studies show that DAQ reduces the feature transition gap and that EDS is crucial to fully unleash DAQ's potential, yielding robust handling of emergence/disappearance. Across five mainstream benchmarks, DVIS-DAQ achieves state-of-the-art results, demonstrating strong practical impact for long videos and real-world scenes.

Abstract

Modern video segmentation methods adopt object queries to perform inter-frame association and demonstrate satisfactory performance in tracking continuously appearing objects despite large-scale motion and transient occlusion. However, they all underperform on newly emerging and disappearing objects that are common in the real world because they attempt to model object emergence and disappearance through feature transitions between background and foreground queries that have significant feature gaps. We introduce Dynamic Anchor Queries (DAQ) to shorten the transition gap between the anchor and target queries by dynamically generating anchor queries based on the features of potential candidates. Furthermore, we introduce a query-level object Emergence and Disappearance Simulation (EDS) strategy, which unleashes DAQ's potential without any additional cost. Finally, we combine our proposed DAQ and EDS with DVIS to obtain DVIS-DAQ. Extensive experiments demonstrate that DVIS-DAQ achieves a new state-of-the-art (SOTA) performance on five mainstream video segmentation benchmarks. Code and models are available at \url{https://github.com/SkyworkAI/DAQ-VS}.
Paper Structure (17 sections, 6 equations, 8 figures, 7 tables)

This paper contains 17 sections, 6 equations, 8 figures, 7 tables.

Figures (8)

  • Figure 1: The preliminary experiments and our results. We analyze current query-based video segmentation methods' capabilities in handling objects' emergence and disappearance within videos. The scatter plot on the left illustrates the recall ratios for newly emerging and disappearing video objects on a subset of the BDD bdd100k dataset. The visual comparison is presented in the middle. The top two rows output from DVIS zhang2023dvis and GenVIS heo2023generalized, respectively. And the bottom row output from our method. The poor performance of previous methods on new emergence detection and disappearance filtering is highlighted by a red circle. On the right side, our method outperforms the current SOTA on mainstream video segmentation datasets.
  • Figure 2: The process of handling newly emerging and disappearing objects in different methods. Unlike previous approaches that treat appearing, disappearing, and tracked objects equally, our method dynamically generates anchor queries for emergence and disappearance based on candidate objects' features to effectively shorten the transition gap.
  • Figure 3: Generation of dynamic anchor queries.$F^{T}$ represents the image feature of the T$^{th}$ frame. Symbols Emg and Dis denote emergence and disappearance. Symbols feat and pos indicate feature and positional embedding, respectively.
  • Figure 4: Tracker with dynamic anchor queries.$Q_{seg}$ represents the query output of segmenter. CTQ and DAQ stand for continuously tracked and dynamic anchor queries, respectively. Symbols Emg and Dis denote emergence and disappearance.
  • Figure 5: The pipelines of emergence and disappearance simulation. CTQ, DAQ$_{Emg}$, DAQ$_{Dis}$, and $Q_{Seg}$ are represented by rectangles, circles, diamonds, and heptagons, respectively. White indicates the background query transformed from the anchor query. The red rectangles highlight the difference between with and without simulation.
  • ...and 3 more figures