Mixed Autoencoder for Self-supervised Visual Representation Learning

Kai Chen; Zhili Liu; Lanqing Hong; Hang Xu; Zhenguo Li; Dit-Yan Yeung

Mixed Autoencoder for Self-supervised Visual Representation Learning

Kai Chen, Zhili Liu, Lanqing Hong, Hang Xu, Zhenguo Li, Dit-Yan Yeung

TL;DR

This work investigates data augmentation for masked image modeling (MIM) and identifies that naive mixing increases mutual information, which paradoxically can ease reconstruction but degrade transfer quality. To address this, it introduces MixedAE, a pure autoencoder framework that combines image mixing with homologous recognition, replaceable homologous attention, segment embeddings, and a dual loss that couples reconstruction with a homologous contrastive term. Empirically, MixedAE achieves state-of-the-art transfer performance across ImageNet-1K, ADE20K, and COCO, while maintaining superior efficiency relative to strong MIM/SSL baselines such as iBOT. The approach yields notably better object-aware pre-training, improving dense perception tasks and suggesting mixing can be a potent augmentation strategy for MIM when guided by pretext design. Overall, MixedAE demonstrates a practical, scalable path to stronger visual representations with reduced pre-training overhead, and code will be released to support reproducibility.

Abstract

Masked Autoencoder (MAE) has demonstrated superior performance on various vision tasks via randomly masking image patches and reconstruction. However, effective data augmentation strategies for MAE still remain open questions, different from those in contrastive learning that serve as the most important part. This paper studies the prevailing mixing augmentation for MAE. We first demonstrate that naive mixing will in contrast degenerate model performance due to the increase of mutual information (MI). To address, we propose homologous recognition, an auxiliary pretext task, not only to alleviate the MI increasement by explicitly requiring each patch to recognize homologous patches, but also to perform object-aware self-supervised pre-training for better downstream dense perception performance. With extensive experiments, we demonstrate that our proposed Mixed Autoencoder (MixedAE) achieves the state-of-the-art transfer results among masked image modeling (MIM) augmentations on different downstream tasks with significant efficiency. Specifically, our MixedAE outperforms MAE by +0.3% accuracy, +1.7 mIoU and +0.9 AP on ImageNet-1K, ADE20K and COCO respectively with a standard ViT-Base. Moreover, MixedAE surpasses iBOT, a strong MIM method combined with instance discrimination, while accelerating training by 2x. To our best knowledge, this is the very first work to consider mixing for MIM from the perspective of pretext task design. Code will be made available.

Mixed Autoencoder for Self-supervised Visual Representation Learning

TL;DR

Abstract

Paper Structure (62 sections, 12 equations, 7 figures, 7 tables)

This paper contains 62 sections, 12 equations, 7 figures, 7 tables.

Introduction
Related Work
Reconstruction target.
Masking strategy.
Input augmentation
Method
Mixing: A Simple Baseline
Mixing.
Unmixing.
Mutual information analysis.
Recognition: Homologous Recognition
Homologous attention
Homologous contrastive
Segment embedding.
Mixing mode.
...and 47 more sections

Figures (7)

Figure 1: Fine-tuning accuracy on ImageNet-1K. Our MixedAE achieves the best trade-off between pre-training overhead and transfer performance. Specifically, MixedAE surpasses MAE MAE consistently with only 3% extra overhead, while outperforms the strong iBOT ibot with only 53.4% of its computation overhead. See more detailed comparisons in \ref{['tab:main_transfer']}. ID stands for instance discrimination, while MIM represents masked image modeling.
Figure 2: Model architecture of Mixed Autoencoder (MixedAE). (a) The input images are first separated into groups to generate mixed samples independently, which are further taken as input to the encoder for feature extraction. (b) The self-attention operations are replaced with our homologous attention, enforcing each patch to only attend to patches with the highest attention mass. (c) The encoder features will be "unmixed" and fed into the decoder for pixel reconstruction. (d) Meanwhile, the homologous contrastive loss is adopted to verify the sampling accuracy by encouraging features of homologous patches to be similar, while heterologous ones to be dissimilar.
Figure 3: Visualization of segment embeddings. (a) Due to the uncertainty of generative modeling, green colors of the cucumber and the forest are both reasonable for patches in the red ellipse. (b) We adopt different segment embeddings for different images to provide necessary information for homologous recognition.
Figure 4: Visualization of two mixing modes when $r=0.5$. (a) Each group generates a single mixed sample for the compose mixing mode, (b) while $1/r$ mixed samples are generated for the full mixing mode to maintain the effective batch size unchanged.
Figure 5: Visualizations of attention maps on images from ImageNet-1K deng2009imagenet (1st-3rd columns), Microsoft COCO lin2014microsoft (4th-6th columns) and ADE20K ADE20K datasets (7th-9th columns). Both MAE and MixedAE are pre-trained on ImageNet-1K for 300 epochs. Compared with MAE which mainly focuses on the most discriminative patches, (e.g., boundaries (1st, 2nd & 5th) and edges (6th & 8th)), MixedAE discovers foreground object patches more precisely (3rd & 9th) and completely (4th & 7th). See more attention maps in \ref{['app:visualization']}.
...and 2 more figures

Mixed Autoencoder for Self-supervised Visual Representation Learning

TL;DR

Abstract

Mixed Autoencoder for Self-supervised Visual Representation Learning

Authors

TL;DR

Abstract

Table of Contents

Figures (7)