DeVOS: Flow-Guided Deformable Transformer for Video Object Segmentation

Volodymyr Fedynyak; Yaroslav Romanus; Bohdan Hlovatskyi; Bohdan Sydor; Oles Dobosevych; Igor Babin; Roman Riazantsev

DeVOS: Flow-Guided Deformable Transformer for Video Object Segmentation

Volodymyr Fedynyak, Yaroslav Romanus, Bohdan Hlovatskyi, Bohdan Sydor, Oles Dobosevych, Igor Babin, Roman Riazantsev

TL;DR

DeVOS introduces Adaptive Deformable Video Attention (ADVA) to VOS by decoupling motion and semantics, enabling motion-guided, query-specific deformable matching that leverages both short-term and long-term memory. By incorporating optical flow priors via QK-flow and multi-scale deformable matching, DeVOS achieves strong temporal consistency and robustness to appearance changes, attaining state-of-the-art results on DAVIS 2017 and YouTube-VOS 2019 with stable runtime. The approach blends memory-based long-term matching with motion-aware short-term propagation, using a ViT-based or ResNet backbone and an efficient flow representation to maintain performance on challenging sequences. Empirical results, including ablations and MOSE 2023 training, demonstrate the effectiveness and robustness of the method across benchmarks, with clear insights on the contributions of multi-scale matching, flow guidance, and backbone choices.

Abstract

The recent works on Video Object Segmentation achieved remarkable results by matching dense semantic and instance-level features between the current and previous frames for long-time propagation. Nevertheless, global feature matching ignores scene motion context, failing to satisfy temporal consistency. Even though some methods introduce local matching branch to achieve smooth propagation, they fail to model complex appearance changes due to the constraints of the local window. In this paper, we present DeVOS (Deformable VOS), an architecture for Video Object Segmentation that combines memory-based matching with motion-guided propagation resulting in stable long-term modeling and strong temporal consistency. For short-term local propagation, we propose a novel attention mechanism ADVA (Adaptive Deformable Video Attention), allowing the adaption of similarity search region to query-specific semantic features, which ensures robust tracking of complex shape and scale changes. DeVOS employs an optical flow to obtain scene motion features which are further injected to deformable attention as strong priors to learnable offsets. Our method achieves top-rank performance on DAVIS 2017 val and test-dev (88.1%, 83.0%), YouTube-VOS 2019 val (86.6%) while featuring consistent run-time speed and stable memory consumption

DeVOS: Flow-Guided Deformable Transformer for Video Object Segmentation

TL;DR

Abstract

Paper Structure (28 sections, 6 equations, 4 figures, 3 tables)

This paper contains 28 sections, 6 equations, 4 figures, 3 tables.

Introduction
Related Work
Optical Flow Estimation
Video Object Segmentation
Vision Transformers and Deformable Attention
Method
Adaptive Deformable Video Attention
QK-flow
Multi-scale matching
Network details
Encoder & Decoder
Object masks
Flow representation
Experiments
Implementation details
...and 13 more sections

Figures (4)

Figure 1: The process of matching features between the current and preceding frames is divided into two steps: flow-based displacement adjustment and semantics-driven deformable attention
Figure 2: The overview of DeVOS architecture. The current frame is processed through encoder and self-attention block. After that, optical flow between current and previous frames is computed for the adaptive deformable video attention between current and previous frame features. Information from a memory bank containing frames for long-term memory is incorporated through a long-term multi-scale deformable attention block.
Figure 3: Adaptive deformable video attention. The multi-scale flow-based feature matching consists of two steps: offsets prediction for features alignment and multi-head attention. Two types of offsets are used: flow-based offsets for movement compensation and semantic-based offsets to extract previous frame image and mask embeddings. Multi-head attention combines the previous frame mask, and image embeddings based on the correlation of the previous frame sampled features and query image embedding vector.
Figure 4: Qualitative comparison between DeVOS and some state-of-the-art VOS methods. Best viewed in zoom. We don't include ISVOS wang2022look since there is no source code available. For all methods we used DAVIS2017 val sequences in 480p.

DeVOS: Flow-Guided Deformable Transformer for Video Object Segmentation

TL;DR

Abstract

DeVOS: Flow-Guided Deformable Transformer for Video Object Segmentation

Authors

TL;DR

Abstract

Table of Contents

Figures (4)