Memory Matching is not Enough: Jointly Improving Memory Matching and Decoding for Video Object Segmentation

Jintu Zheng; Yun Liang; Yuqing Zhang; Wanchao Su

Memory Matching is not Enough: Jointly Improving Memory Matching and Decoding for Video Object Segmentation

Jintu Zheng, Yun Liang, Yuqing Zhang, Wanchao Su

TL;DR

This paper proposes an effective approach which jointly improving the matching and decoding stages to alleviate the false matching issue, and implements a compensatory mechanism aims at recovering the essential information where missing at the matching stage.

Abstract

Memory-based video object segmentation methods model multiple objects over long temporal-spatial spans by establishing memory bank, which achieve the remarkable performance. However, they struggle to overcome the false matching and are prone to lose critical information, resulting in confusion among different objects. In this paper, we propose an effective approach which jointly improving the matching and decoding stages to alleviate the false matching issue.For the memory matching stage, we present a cost aware mechanism that suppresses the slight errors for short-term memory and a shunted cross-scale matching for long-term memory which establish a wide filed matching spaces for various object scales. For the readout decoding stage, we implement a compensatory mechanism aims at recovering the essential information where missing at the matching stage. Our approach achieves the outstanding performance in several popular benchmarks (i.e., DAVIS 2016&2017 Val (92.4%&88.1%), and DAVIS 2017 Test (83.9%)), and achieves 84.8%&84.6% on YouTubeVOS 2018&2019 Val.

Memory Matching is not Enough: Jointly Improving Memory Matching and Decoding for Video Object Segmentation

TL;DR

Abstract

Paper Structure (13 sections, 10 equations, 5 figures, 5 tables)

This paper contains 13 sections, 10 equations, 5 figures, 5 tables.

Introduction
Related Work
Methodology
Proposal Overview
Cross-Scale in Long-Term Matching
Cost-Aware Matching
Compensatory Decoding
Experiments
Implementation Details
Datasets and Metrics
Compare with the State-of-the-Art Methods
Ablation Studies
Conclusion

Figures (5)

Figure 1: (a) Comparisons of a representative video clip in DAVIS 2017 Test. AOT AOT and XMem XMem (two state-of-the-art matching-based VOS models) present false matching errors and our method can produce more accurate masks. (b-c) A simplified comparison on pipeline between ours and previous matching-based methods. Previous matching-based lacks the consideration of combining two stages to improve.
Figure 2: (a) Pipeline of our proposal, which improves the memory matching stage by cost-aware and cross-scale matching, and improves the decoding stage by compensatory decoding. (b-c) Illustration of cross-scale and cost-aware matching.
Figure 3: Illustration of compensatory decoding which compensates the low-level information for the initial readout.
Figure 4: Representative challenge cases of qualitative comparison with XMem XMem, AOT AOT, and STM STM.
Figure 5: Visualization of the memory readout features of AOT, our method's initial readout and the final readout (i.e., feature after context embedding).

Memory Matching is not Enough: Jointly Improving Memory Matching and Decoding for Video Object Segmentation

TL;DR

Abstract

Memory Matching is not Enough: Jointly Improving Memory Matching and Decoding for Video Object Segmentation

Authors

TL;DR

Abstract

Table of Contents

Figures (5)