The Dynamic Duo of Collaborative Masking and Target for Advanced Masked Autoencoder Learning
Shentong Mo
TL;DR
CMT-MAE tackles masked autoencoder pre-training by introducing collaborative masking and collaborative targets that fuse teacher and student knowledge. A two-stage process aggregates teacher attention $\mathbf{A}^t$ and student momentum attention $\mathbf{A}^s$ into a collaborative map $\mathbf{A}^c$ to guide masking, while reconstruction targets combine teacher features $\mathbf{f}_i^t$ and student features $\mathbf{f}_i^s$ via a weighted loss controlled by $\alpha$. This dynamic teacher-student collaboration yields state-of-the-art results on ImageNet-1K classification and strong gains on ADE20K, DAVIS, and COCO across linear probing, finetuning, and downstream tasks. The approach demonstrates that simple, effective integration of self-training feedback into MIM can significantly improve representation learning with transformers.
Abstract
Masked autoencoders (MAE) have recently succeeded in self-supervised vision representation learning. Previous work mainly applied custom-designed (e.g., random, block-wise) masking or teacher (e.g., CLIP)-guided masking and targets. However, they ignore the potential role of the self-training (student) model in giving feedback to the teacher for masking and targets. In this work, we present to integrate Collaborative Masking and Targets for boosting Masked AutoEncoders, namely CMT-MAE. Specifically, CMT-MAE leverages a simple collaborative masking mechanism through linear aggregation across attentions from both teacher and student models. We further propose using the output features from those two models as the collaborative target of the decoder. Our simple and effective framework pre-trained on ImageNet-1K achieves state-of-the-art linear probing and fine-tuning performance. In particular, using ViT-base, we improve the fine-tuning results of the vanilla MAE from 83.6% to 85.7%.
