Table of Contents
Fetching ...

DropMAE: Learning Representations via Masked Autoencoders with Spatial-Attention Dropout for Temporal Matching Tasks

Qiangqiang Wu, Tianyu Yang, Ziquan Liu, Wei Lin, Baoyuan Wu, Antoni B. Chan

TL;DR

DropMAE advances video-centered self-supervised pre-training by introducing Adaptive Spatial-Attention Dropout to force temporal learning in masked autoencoders. The method yields a general, efficient backbone that improves temporal matching across VOT, VOS, optical flow, long-term tracking, 3D point tracking, and self-supervised correspondence learning, often with substantial pre-training or training-time speedups. Extensive experiments on 13 benchmarks demonstrate strong, broad transfer to diverse tracking tasks, outperforming ImageNet-based MAE and competing with task-specific approaches. The work highlights the practical impact of temporal priors in pre-training and paves the way for broader adoption of temporally-aware ViT backbones in tracking systems.

Abstract

This paper studies masked autoencoder (MAE) video pre-training for various temporal matching-based downstream tasks, i.e., object-level tracking tasks including video object tracking (VOT) and video object segmentation (VOS), self-supervised visual correspondence learning, dense tracking tasks including optical flow estimation and long-term point tracking, and 3D point cloud tracking. Specifically, our work explores to provide a general representation to boost the temporal matching ability in various downstream tracking tasks. To achieve this, we firstly find that a simple extension of MAE, which randomly masks out frame patches in videos and reconstruct the frame pixels, heavily relies on spatial cues while ignoring temporal relations for frame reconstruction, thus leading to sub-optimal temporal matching representations. To alleviate this, we propose DropMAE, which adaptively performs spatial-attention dropout in the frame reconstruction to facilitate temporal correspondence learning in videos. We obtain several important findings with DropMAE: 1) DropMAE is a strong and efficient temporal matching learner, which achieves better fine-tuning results on matching-based tasks than the ImageNet-based MAE with 2x faster pre-training speed. 2) DropMAE is effective for different tracking tasks, i.e., object-level matching tasks including VOT and VOS, dense tracking tasks including optical flow estimation and tracking any point (TAP), and even 3D tracking in the different modality of point cloud data. Since none exists, we build ViT-based trackers for different downstream tracking tasks, and our pre-trained DropMAE model can be directly loaded in these ViT-based trackers for fine-tuning without further modifications. Experiments on 6 downstream tracking tasks demonstrate the effectiveness of DropMAE as a general pre-trained representation for diverse tracking tasks.

DropMAE: Learning Representations via Masked Autoencoders with Spatial-Attention Dropout for Temporal Matching Tasks

TL;DR

DropMAE advances video-centered self-supervised pre-training by introducing Adaptive Spatial-Attention Dropout to force temporal learning in masked autoencoders. The method yields a general, efficient backbone that improves temporal matching across VOT, VOS, optical flow, long-term tracking, 3D point tracking, and self-supervised correspondence learning, often with substantial pre-training or training-time speedups. Extensive experiments on 13 benchmarks demonstrate strong, broad transfer to diverse tracking tasks, outperforming ImageNet-based MAE and competing with task-specific approaches. The work highlights the practical impact of temporal priors in pre-training and paves the way for broader adoption of temporally-aware ViT backbones in tracking systems.

Abstract

This paper studies masked autoencoder (MAE) video pre-training for various temporal matching-based downstream tasks, i.e., object-level tracking tasks including video object tracking (VOT) and video object segmentation (VOS), self-supervised visual correspondence learning, dense tracking tasks including optical flow estimation and long-term point tracking, and 3D point cloud tracking. Specifically, our work explores to provide a general representation to boost the temporal matching ability in various downstream tracking tasks. To achieve this, we firstly find that a simple extension of MAE, which randomly masks out frame patches in videos and reconstruct the frame pixels, heavily relies on spatial cues while ignoring temporal relations for frame reconstruction, thus leading to sub-optimal temporal matching representations. To alleviate this, we propose DropMAE, which adaptively performs spatial-attention dropout in the frame reconstruction to facilitate temporal correspondence learning in videos. We obtain several important findings with DropMAE: 1) DropMAE is a strong and efficient temporal matching learner, which achieves better fine-tuning results on matching-based tasks than the ImageNet-based MAE with 2x faster pre-training speed. 2) DropMAE is effective for different tracking tasks, i.e., object-level matching tasks including VOT and VOS, dense tracking tasks including optical flow estimation and tracking any point (TAP), and even 3D tracking in the different modality of point cloud data. Since none exists, we build ViT-based trackers for different downstream tracking tasks, and our pre-trained DropMAE model can be directly loaded in these ViT-based trackers for fine-tuning without further modifications. Experiments on 6 downstream tracking tasks demonstrate the effectiveness of DropMAE as a general pre-trained representation for diverse tracking tasks.
Paper Structure (32 sections, 6 equations, 14 figures, 15 tables, 1 algorithm)

This paper contains 32 sections, 6 equations, 14 figures, 15 tables, 1 algorithm.

Figures (14)

  • Figure 1: A general DropMAE pre-trained model for various downstream tracking tasks including object-level tracking (i.e., VOT and VOS), 3D point cloud tracking, dense tracking (i.e., optical flow estimation and long-term point tracking) and self-supervised correspondence learning for unsupervised tracking.
  • Figure 2: Visualization of the attention maps of the TwinMAE baseline and our DropMAE in the reconstruction of a random masked patch, which is denoted as a red bounding box in the left input frame. TwinMAE leverages the spatial cues (within the same frame) more than temporal cues (between frames) for reconstruction. Our proposed DropMAE improves the baseline by effectively alleviating co-adaptation between spatial cues in the reconstruction, focusing more on temporal cues, thus achieving better learning of temporal correspondences for tracking tasks.
  • Figure 3: An illustration of our DropMAE. The proposed adaptive spatial-attention dropout (ASAD) facilitates temporal correspondence learning for temporal matching tasks. TwinMAE follows the same pipeline except that the ASAD module is not used.
  • Figure 4: The average within-frame and between-frame attention scores obtained by TwinMAE and DropMAE in different decoder layers. The attention score is calculated on 20 randomly sampled K400 validation videos, and is averaged on all heads and locations.
  • Figure 5: Visualization of the temporal matching function $f_{tem}$ on an example frame pair. A large value of $f_{tem}(i)$ indicates that the $i$-th pixel matches well to a pixel in the other frame.
  • ...and 9 more figures