Rethinking Patch Dependence for Masked Autoencoders
Letian Fu, Long Lian, Renhao Wang, Baifeng Shi, Xudong Wang, Adam Yala, Trevor Darrell, Alexei A. Efros, Ken Goldberg
TL;DR
This work challenges the necessity of patch-to-patch interactions in MAE by showing that the encoder learns a global representation sufficient for coherent masked reconstruction. It introduces CrossMAE, which uses a cross-attention decoder to read out reconstructions from encoder outputs, enabling independent decoding of masked patches and partial reconstruction for efficiency. Across ViT-S to ViT-H, CrossMAE achieves comparable or better downstream performance than MAE on ImageNet-1K and COCO while significantly reducing decoder FLOPS and memory usage. The findings highlight the encoder’s role in global context learning and propose a scalable, efficient masked pretraining paradigm with potential for large-scale visual learning.
Abstract
In this work, we examine the impact of inter-patch dependencies in the decoder of masked autoencoders (MAE) on representation learning. We decompose the decoding mechanism for masked reconstruction into self-attention between mask tokens and cross-attention between masked and visible tokens. Our findings reveal that MAE reconstructs coherent images from visible patches not through interactions between patches in the decoder but by learning a global representation within the encoder. This discovery leads us to propose a simple visual pretraining framework: cross-attention masked autoencoders (CrossMAE). This framework employs only cross-attention in the decoder to independently read out reconstructions for a small subset of masked patches from encoder outputs. This approach achieves comparable or superior performance to traditional MAE across models ranging from ViT-S to ViT-H and significantly reduces computational requirements. By its design, CrossMAE challenges the necessity of interaction between mask tokens for effective masked pretraining. Code and models are publicly available: https://crossmae.github.io
