Table of Contents
Fetching ...

Rethinking Patch Dependence for Masked Autoencoders

Letian Fu, Long Lian, Renhao Wang, Baifeng Shi, Xudong Wang, Adam Yala, Trevor Darrell, Alexei A. Efros, Ken Goldberg

TL;DR

This work challenges the necessity of patch-to-patch interactions in MAE by showing that the encoder learns a global representation sufficient for coherent masked reconstruction. It introduces CrossMAE, which uses a cross-attention decoder to read out reconstructions from encoder outputs, enabling independent decoding of masked patches and partial reconstruction for efficiency. Across ViT-S to ViT-H, CrossMAE achieves comparable or better downstream performance than MAE on ImageNet-1K and COCO while significantly reducing decoder FLOPS and memory usage. The findings highlight the encoder’s role in global context learning and propose a scalable, efficient masked pretraining paradigm with potential for large-scale visual learning.

Abstract

In this work, we examine the impact of inter-patch dependencies in the decoder of masked autoencoders (MAE) on representation learning. We decompose the decoding mechanism for masked reconstruction into self-attention between mask tokens and cross-attention between masked and visible tokens. Our findings reveal that MAE reconstructs coherent images from visible patches not through interactions between patches in the decoder but by learning a global representation within the encoder. This discovery leads us to propose a simple visual pretraining framework: cross-attention masked autoencoders (CrossMAE). This framework employs only cross-attention in the decoder to independently read out reconstructions for a small subset of masked patches from encoder outputs. This approach achieves comparable or superior performance to traditional MAE across models ranging from ViT-S to ViT-H and significantly reduces computational requirements. By its design, CrossMAE challenges the necessity of interaction between mask tokens for effective masked pretraining. Code and models are publicly available: https://crossmae.github.io

Rethinking Patch Dependence for Masked Autoencoders

TL;DR

This work challenges the necessity of patch-to-patch interactions in MAE by showing that the encoder learns a global representation sufficient for coherent masked reconstruction. It introduces CrossMAE, which uses a cross-attention decoder to read out reconstructions from encoder outputs, enabling independent decoding of masked patches and partial reconstruction for efficiency. Across ViT-S to ViT-H, CrossMAE achieves comparable or better downstream performance than MAE on ImageNet-1K and COCO while significantly reducing decoder FLOPS and memory usage. The findings highlight the encoder’s role in global context learning and propose a scalable, efficient masked pretraining paradigm with potential for large-scale visual learning.

Abstract

In this work, we examine the impact of inter-patch dependencies in the decoder of masked autoencoders (MAE) on representation learning. We decompose the decoding mechanism for masked reconstruction into self-attention between mask tokens and cross-attention between masked and visible tokens. Our findings reveal that MAE reconstructs coherent images from visible patches not through interactions between patches in the decoder but by learning a global representation within the encoder. This discovery leads us to propose a simple visual pretraining framework: cross-attention masked autoencoders (CrossMAE). This framework employs only cross-attention in the decoder to independently read out reconstructions for a small subset of masked patches from encoder outputs. This approach achieves comparable or superior performance to traditional MAE across models ranging from ViT-S to ViT-H and significantly reduces computational requirements. By its design, CrossMAE challenges the necessity of interaction between mask tokens for effective masked pretraining. Code and models are publicly available: https://crossmae.github.io
Paper Structure (33 sections, 2 equations, 7 figures, 18 tables)

This paper contains 33 sections, 2 equations, 7 figures, 18 tables.

Figures (7)

  • Figure 1: Method Overview. (A) Masked autoencoder (MAE) starts by masking random patches of the input image. (B) To reconstruct a mask token (marked by the blue star), MAE attends to both the masked tokens (B.Left) and the visible tokens (B.Right). A quantitative comparison over the ImageNet validation set shows that the masked tokens in MAE disproportionally attend to the visible tokens (1.42 vs 0.39), questioning the necessity of attention within mask tokens. (C) We propose CrossMAE, the masked patches are reconstructed from only the cross attention between the masked tokens and the visible tokens. Surprisingly, CrossMAE attains the same or better performance than MAE on ImageNet classification and COCO instance segmentation.
  • Figure 2: Example reconstructions of ImageNet validation images. For each set of 5 images, from left to right, are the original image, masked image with a mask ratio of 75%, MAE He2021, CrossMAE (trained to reconstruct 25% of image tokens, or 1/3 of the mask tokens), and CrossMAE (trained to reconstruct all masked tokens). Since CrossMAE does not reconstruct them, all model outputs have the visible patches overlaid. Intriguingly, CrossMAE, when trained for partial reconstruction, can decode all mask tokens in one forward pass (shown above), indicating that the encoder rather than the decoder effectively captures global image information in its output tokens. Its comparable reconstruction quality to full-image-trained models suggests that full-image reconstruction might not be essential for effective representation learning.
  • Figure 3: MAE He2021 concatenates all mask tokens with the visible patch features from a ViT encoder and passes them to a decoder with self-attention blocks to reconstruct the original image. Patches that correspond to visible tokens are then dropped, and an L2 loss is applied to the rest of the reconstruction as the pretraining objective. CrossMAE instead uses cross-attention blocks in the decoder to reconstruct only a subset of the masked tokens.
  • Figure 4: Overview of CrossMAE.(a) The vanilla version of CrossMAE uses the output of the last encoder block as the keys and queries for cross-attention. The first decoder block takes the sum of mask tokens and their corresponding positional embeddings as queries, and subsequent layers use the output of the previous decoder block as queries to reconstruct the masked patches. (b) Unlike the decoder block in Vaswani2017, the cross-attention decoder block does not contain self-attention, decoupling the generation of different masked patches. (c) CrossMAE's decoder blocks can leverage low-level features for reconstruction via inter-block attention. It weighs the intermediate feature maps, and the weighted sum of feature maps is used as the key and value for each decoder block.
  • Figure 5: We visualize the output of each decoder block. (a-b) Different decoder blocks play different roles in the reconstruction, with most details emerging at later decoder blocks, which confirms the motivation for inter-block attention. (c) Visualizations of inter-block attention shows that different decoder blocks indeed attend to feature from different encoder blocks, with later blocks focusing on earlier encoder features to achieve reconstruction. The reconstructions are unnormalized w.r.t ground truth mean and std for each patch.
  • ...and 2 more figures