Table of Contents
Fetching ...

The Dynamic Duo of Collaborative Masking and Target for Advanced Masked Autoencoder Learning

Shentong Mo

TL;DR

CMT-MAE tackles masked autoencoder pre-training by introducing collaborative masking and collaborative targets that fuse teacher and student knowledge. A two-stage process aggregates teacher attention $\mathbf{A}^t$ and student momentum attention $\mathbf{A}^s$ into a collaborative map $\mathbf{A}^c$ to guide masking, while reconstruction targets combine teacher features $\mathbf{f}_i^t$ and student features $\mathbf{f}_i^s$ via a weighted loss controlled by $\alpha$. This dynamic teacher-student collaboration yields state-of-the-art results on ImageNet-1K classification and strong gains on ADE20K, DAVIS, and COCO across linear probing, finetuning, and downstream tasks. The approach demonstrates that simple, effective integration of self-training feedback into MIM can significantly improve representation learning with transformers.

Abstract

Masked autoencoders (MAE) have recently succeeded in self-supervised vision representation learning. Previous work mainly applied custom-designed (e.g., random, block-wise) masking or teacher (e.g., CLIP)-guided masking and targets. However, they ignore the potential role of the self-training (student) model in giving feedback to the teacher for masking and targets. In this work, we present to integrate Collaborative Masking and Targets for boosting Masked AutoEncoders, namely CMT-MAE. Specifically, CMT-MAE leverages a simple collaborative masking mechanism through linear aggregation across attentions from both teacher and student models. We further propose using the output features from those two models as the collaborative target of the decoder. Our simple and effective framework pre-trained on ImageNet-1K achieves state-of-the-art linear probing and fine-tuning performance. In particular, using ViT-base, we improve the fine-tuning results of the vanilla MAE from 83.6% to 85.7%.

The Dynamic Duo of Collaborative Masking and Target for Advanced Masked Autoencoder Learning

TL;DR

CMT-MAE tackles masked autoencoder pre-training by introducing collaborative masking and collaborative targets that fuse teacher and student knowledge. A two-stage process aggregates teacher attention and student momentum attention into a collaborative map to guide masking, while reconstruction targets combine teacher features and student features via a weighted loss controlled by . This dynamic teacher-student collaboration yields state-of-the-art results on ImageNet-1K classification and strong gains on ADE20K, DAVIS, and COCO across linear probing, finetuning, and downstream tasks. The approach demonstrates that simple, effective integration of self-training feedback into MIM can significantly improve representation learning with transformers.

Abstract

Masked autoencoders (MAE) have recently succeeded in self-supervised vision representation learning. Previous work mainly applied custom-designed (e.g., random, block-wise) masking or teacher (e.g., CLIP)-guided masking and targets. However, they ignore the potential role of the self-training (student) model in giving feedback to the teacher for masking and targets. In this work, we present to integrate Collaborative Masking and Targets for boosting Masked AutoEncoders, namely CMT-MAE. Specifically, CMT-MAE leverages a simple collaborative masking mechanism through linear aggregation across attentions from both teacher and student models. We further propose using the output features from those two models as the collaborative target of the decoder. Our simple and effective framework pre-trained on ImageNet-1K achieves state-of-the-art linear probing and fine-tuning performance. In particular, using ViT-base, we improve the fine-tuning results of the vanilla MAE from 83.6% to 85.7%.

Paper Structure

This paper contains 11 sections, 4 equations, 3 figures, 5 tables.

Figures (3)

  • Figure 1: Comparison of our CMT-MAE with MAE and DINO on pre-trained ViT-B/16. Our method significantly outperforms previous baselines in terms of all downstream tasks.
  • Figure 2: Illustration of the proposed Masked Autoencoder with Collaborative Masking and Targets (CMT-MAE) framework. First-stage: a teacher transformer encoder (i.e., CLIP) takes an input image to extract an attention map $\bf{A}^t$ from the last attention layer to guide masking. The student encoder generates features from unmasked patches, which are concatenated with masked tokens to feed into a decoder for recovering the teacher features $\mathbf{f}_i^t$ of masked patches. Second-stage: a student momentum encoder takes the input image to generate a student-guided attention map $\bf{A}^s$, and linearly aggregates with a teacher-guided attention map $\bf{A}^t$ to produce the collaborative attention map $\bf{A}^c$ with a collaborative ratio $\alpha$ for collaborative masking. Then masked tokens concatenate with features of unmasked patches from the student encoder to feed into the decoder. Finally, two predicted heads are linearly applied to reconstruct the teacher features $\mathbf{f}_i^t$ and student features $\mathbf{f}_i^s$ of masked patches for collaborative targets. Note that the collaborative ratio $\alpha$ is also applied to calculate collaborative losses from the teacher and student targets.
  • Figure 3: Visualizations of DAVIS 2017 video object segmentation. Four rows for each case represent raw frames, ground-truth masks, MAE predictions, and our CMT-MAE predictions. We visualize the segmentation masks of DAVIS 2017 video object segmentation using ViT-B/16 pre-trained on ImageNet-1K. The proposed CMT-MAE produces much more accurate and high-quality segmentation masks.