Table of Contents
Fetching ...

Spatial-Temporal Multi-level Association for Video Object Segmentation

Deshui Miao, Xin Li, Zhenyu He, Huchuan Lu, Ming-Hsuan Yang

TL;DR

The paper tackles semi-supervised video object segmentation by addressing the need for sufficient target interaction and efficient parallel processing. It introduces the Spatial-Temporal Multi-Level Association (STMA) framework, consisting of a spatial-temporal multi-level feature association module (STML), a spatial-temporal memory bank, and an ID association pipeline, enabling dynamic, target-aware feature learning. The STML decouples attention into object self-attention, reference object enhancement, and test-reference correlation, while the memory bank supports long-term ID tracking; this combination yields strong performance on DAVIS 2016/2017 and YouTube-VOS 2018/2019, including competitive results without pretraining. The work provides robust improvements for small targets and long-duration sequences and will release code and trained models to facilitate reproducibility and further research in video object segmentation.

Abstract

Existing semi-supervised video object segmentation methods either focus on temporal feature matching or spatial-temporal feature modeling. However, they do not address the issues of sufficient target interaction and efficient parallel processing simultaneously, thereby constraining the learning of dynamic, target-aware features. To tackle these limitations, this paper proposes a spatial-temporal multi-level association framework, which jointly associates reference frame, test frame, and object features to achieve sufficient interaction and parallel target ID association with a spatial-temporal memory bank for efficient video object segmentation. Specifically, we construct a spatial-temporal multi-level feature association module to learn better target-aware features, which formulates feature extraction and interaction as the efficient operations of object self-attention, reference object enhancement, and test reference correlation. In addition, we propose a spatial-temporal memory to assist feature association and temporal ID assignment and correlation. We evaluate the proposed method by conducting extensive experiments on numerous video object segmentation datasets, including DAVIS 2016/2017 val, DAVIS 2017 test-dev, and YouTube-VOS 2018/2019 val. The favorable performance against the state-of-the-art methods demonstrates the effectiveness of our approach. All source code and trained models will be made publicly available.

Spatial-Temporal Multi-level Association for Video Object Segmentation

TL;DR

The paper tackles semi-supervised video object segmentation by addressing the need for sufficient target interaction and efficient parallel processing. It introduces the Spatial-Temporal Multi-Level Association (STMA) framework, consisting of a spatial-temporal multi-level feature association module (STML), a spatial-temporal memory bank, and an ID association pipeline, enabling dynamic, target-aware feature learning. The STML decouples attention into object self-attention, reference object enhancement, and test-reference correlation, while the memory bank supports long-term ID tracking; this combination yields strong performance on DAVIS 2016/2017 and YouTube-VOS 2018/2019, including competitive results without pretraining. The work provides robust improvements for small targets and long-duration sequences and will release code and trained models to facilitate reproducibility and further research in video object segmentation.

Abstract

Existing semi-supervised video object segmentation methods either focus on temporal feature matching or spatial-temporal feature modeling. However, they do not address the issues of sufficient target interaction and efficient parallel processing simultaneously, thereby constraining the learning of dynamic, target-aware features. To tackle these limitations, this paper proposes a spatial-temporal multi-level association framework, which jointly associates reference frame, test frame, and object features to achieve sufficient interaction and parallel target ID association with a spatial-temporal memory bank for efficient video object segmentation. Specifically, we construct a spatial-temporal multi-level feature association module to learn better target-aware features, which formulates feature extraction and interaction as the efficient operations of object self-attention, reference object enhancement, and test reference correlation. In addition, we propose a spatial-temporal memory to assist feature association and temporal ID assignment and correlation. We evaluate the proposed method by conducting extensive experiments on numerous video object segmentation datasets, including DAVIS 2016/2017 val, DAVIS 2017 test-dev, and YouTube-VOS 2018/2019 val. The favorable performance against the state-of-the-art methods demonstrates the effectiveness of our approach. All source code and trained models will be made publicly available.
Paper Structure (13 sections, 5 equations, 6 figures, 4 tables)

This paper contains 13 sections, 5 equations, 6 figures, 4 tables.

Figures (6)

  • Figure 1: Performance on challenging VOS scenarios with tiny objects and long-term changes. XMem xmem, DeAOT deaot, and SimVOS SimVOS do not work well on this scenario. Our method accurately predicts the mask of the 'baby monkey' (marked by the red box) over frames.
  • Figure 2: Overall framework. It consists of a spatial-temporal multi-level (STML) feature association part, a prediction module, and a spatial-temporal memory. The STML module conducts simultaneous feature extraction and correlation. The spatial-temporal memory not only provides object features and reference frames for STML but also offers temporal feature information for ID association..
  • Figure 3: Illustration of the proposed spatial-temporal correlation. Given two reference frames as examples. The object features conduct self-attention and the reference features perform attention both with themselves and object features. The target feature undergoes attention with both itself and the reference feature simultaneously.
  • Figure 4: Visualized results on sequences with small and faint objects. It shows that our method generates finer masks compared to the state-of-the-art methods.
  • Figure 5: Visualized results on sequences with complicated ID connections. The proposed method performs well in tracking the tennis rackets, which demonstrates excellent performance in terms of ID propagation.
  • ...and 1 more figures