Efficient Video Object Segmentation via Modulated Cross-Attention Memory

Abdelrahman Shaker; Syed Talal Wasim; Martin Danelljan; Salman Khan; Ming-Hsuan Yang; Fahad Shahbaz Khan

Efficient Video Object Segmentation via Modulated Cross-Attention Memory

Abdelrahman Shaker, Syed Talal Wasim, Martin Danelljan, Salman Khan, Ming-Hsuan Yang, Fahad Shahbaz Khan

TL;DR

A transformer-based approach that introduces an optimized and dynamic long-term modulated cross-attention (MCA) memory to model temporal smoothness without requiring frequent memory expansion on long videos.

Abstract

Recently, transformer-based approaches have shown promising results for semi-supervised video object segmentation. However, these approaches typically struggle on long videos due to increased GPU memory demands, as they frequently expand the memory bank every few frames. We propose a transformer-based approach, named MAVOS, that introduces an optimized and dynamic long-term modulated cross-attention (MCA) memory to model temporal smoothness without requiring frequent memory expansion. The proposed MCA effectively encodes both local and global features at various levels of granularity while efficiently maintaining consistent speed regardless of the video length. Extensive experiments on multiple benchmarks, LVOS, Long-Time Video, and DAVIS 2017, demonstrate the effectiveness of our proposed contributions leading to real-time inference and markedly reduced memory demands without any degradation in segmentation accuracy on long videos. Compared to the best existing transformer-based approach, our MAVOS increases the speed by 7.6x, while significantly reducing the GPU memory by 87% with comparable segmentation performance on short and long video datasets. Notably on the LVOS dataset, our MAVOS achieves a J&F score of 63.3% while operating at 37 frames per second (FPS) on a single V100 GPU. Our code and models will be publicly available at: https://github.com/Amshaker/MAVOS.

Efficient Video Object Segmentation via Modulated Cross-Attention Memory

TL;DR

Abstract

Paper Structure (20 sections, 6 equations, 8 figures, 8 tables)

This paper contains 20 sections, 6 equations, 8 figures, 8 tables.

Introduction
Related Work
Method
MAVOS Architecture
E-LSTT Block
Modulated Cross-Attention (MCA) Memory
Experiments
Setup
Datasets
Results on LVOS
Results on LTV
Results on DAVIS 2017
Ablations
Conclusion
Acknowledgment
...and 5 more sections

Figures (8)

Figure 1: Left: Comparison of our proposed MAVOS with recent transformer-based methods using the same backbone in terms of speed (FPS) and mean frames per video, along with GPU memory consumption (in GB) and VOS performance on the graph. Recent transformer-based approaches exhibit a substantial reduction in speed and memory explosion for long videos, while MAVOS models maintain consistent speed without GPU memory issues and no significant performance degradation in both short (DAVIS Pont-Tuset_arXiv_2017) and long-video datasets (LVOS voigtlaender2019feelvos, LTV liang2020afb-ubr). FPS is measured using a V100 GPU. Right: MAVOS results on long videos from LVOS (top) and LTV (bottom) datasets, showcasing robust performance with more than 120 seconds for LVOS and more than 3500 frames for LTV. Additional results are presented in suppl. material.
Figure 2: Overview of the proposed MAVOS.(a) An illustration for the overall architecture. The video frames are passed to the lightweight encoder to extract the frame features, followed by the proposed E-LSTT block to handle the long-term memory efficiently, followed by the decoder to generate the masks. (b) The details of the E-LSTT block. It mainly consists of short-term, long-term, and self-propagations to propagate the target information from the previous frames. The long-term propagation is based on the proposed Modulated Cross-Attention (MCA) memory. (c) Our proposed MCA memory. The new memory is projected to queries, and the memory context is projected to keys and values. We apply hierarchical contextualization using depth-wise convolution to generate the local context and multiply it by learnable gates. The aggregated context is projected and multiplied by the attention maps to generate the output.
Figure 3: Interpretation of the MCA memory. Interpreting the dynamic frame of the MCA memory reveals its ability to encode temporal smoothness over time along the boundary of the black swan.
Figure 4: Qualitative comparison between MAVOS and the SOTA methods on LVOS val set with the same backbone. While the masks of AOT-L and DeAOT-L with infinite memory banks are promising, their real-time performance is hindered by a low FPS. On the other hand, XMem exhibits good FPS but struggles in challenging scenarios (marked in red dashed box). In the third row, XMem fails to recover the person occluded by the walls in the past few frames. In the fourth row, it confuses the person with the skateboard due to another disappearance. In contrast, our MAVOS accurately segments the targets despite the absence and occlusion with real-time FPS.
Figure 5: Qualitative comparison between MAVOS and baseline DeAOT-L on LVOS val set. Left: With two memory frames, the baseline struggles to correctly segment the target when it disappeared at frame 891 and reappears at frame 1581. Even with six memory frames, the baseline exhibits confusion between the target (small zebra) and potentially its mother (middle zebra) at frame 1581. Right: At frames 289 and 748, the baseline with two memory frames confuses both kites due to occlusion, and with six memory frames it over-segments the tails. In contrast, MAVOS shows impressive performance, accurately delineating targets despite the absence and occlusion.
...and 3 more figures

Efficient Video Object Segmentation via Modulated Cross-Attention Memory

TL;DR

Abstract

Efficient Video Object Segmentation via Modulated Cross-Attention Memory

Authors

TL;DR

Abstract

Table of Contents

Figures (8)