Table of Contents
Fetching ...

MAGE: MAsked Generative Encoder to Unify Representation Learning and Image Synthesis

Tianhong Li, Huiwen Chang, Shlok Kumar Mishra, Han Zhang, Dina Katabi, Dilip Krishnan

TL;DR

MAGE addresses the gap of a single model capable of both high-fidelity image generation and robust self-supervised representations. It achieves this by unifying masked image modeling with variable masking ratios and semantic tokenization via a VQGAN, enabling both generation and representation learning within one framework. The approach delivers state-of-the-art class-unconditional generation on ImageNet-1K and leading linear-probing performance, with further gains when paired with a simple contrastive loss (MAGE-C). This unified paradigm reduces training and maintenance overhead while delivering strong transferability and practical image synthesis capabilities. Future work could scale to larger unlabeled datasets to push performance further.

Abstract

Generative modeling and representation learning are two key tasks in computer vision. However, these models are typically trained independently, which ignores the potential for each task to help the other, and leads to training and model maintenance overheads. In this work, we propose MAsked Generative Encoder (MAGE), the first framework to unify SOTA image generation and self-supervised representation learning. Our key insight is that using variable masking ratios in masked image modeling pre-training can allow generative training (very high masking ratio) and representation learning (lower masking ratio) under the same training framework. Inspired by previous generative models, MAGE uses semantic tokens learned by a vector-quantized GAN at inputs and outputs, combining this with masking. We can further improve the representation by adding a contrastive loss to the encoder output. We extensively evaluate the generation and representation learning capabilities of MAGE. On ImageNet-1K, a single MAGE ViT-L model obtains 9.10 FID in the task of class-unconditional image generation and 78.9% top-1 accuracy for linear probing, achieving state-of-the-art performance in both image generation and representation learning. Code is available at https://github.com/LTH14/mage.

MAGE: MAsked Generative Encoder to Unify Representation Learning and Image Synthesis

TL;DR

MAGE addresses the gap of a single model capable of both high-fidelity image generation and robust self-supervised representations. It achieves this by unifying masked image modeling with variable masking ratios and semantic tokenization via a VQGAN, enabling both generation and representation learning within one framework. The approach delivers state-of-the-art class-unconditional generation on ImageNet-1K and leading linear-probing performance, with further gains when paired with a simple contrastive loss (MAGE-C). This unified paradigm reduces training and maintenance overhead while delivering strong transferability and practical image synthesis capabilities. Future work could scale to larger unlabeled datasets to push performance further.

Abstract

Generative modeling and representation learning are two key tasks in computer vision. However, these models are typically trained independently, which ignores the potential for each task to help the other, and leads to training and model maintenance overheads. In this work, we propose MAsked Generative Encoder (MAGE), the first framework to unify SOTA image generation and self-supervised representation learning. Our key insight is that using variable masking ratios in masked image modeling pre-training can allow generative training (very high masking ratio) and representation learning (lower masking ratio) under the same training framework. Inspired by previous generative models, MAGE uses semantic tokens learned by a vector-quantized GAN at inputs and outputs, combining this with masking. We can further improve the representation by adding a contrastive loss to the encoder output. We extensively evaluate the generation and representation learning capabilities of MAGE. On ImageNet-1K, a single MAGE ViT-L model obtains 9.10 FID in the task of class-unconditional image generation and 78.9% top-1 accuracy for linear probing, achieving state-of-the-art performance in both image generation and representation learning. Code is available at https://github.com/LTH14/mage.
Paper Structure (17 sections, 3 equations, 13 figures, 18 tables)

This paper contains 17 sections, 3 equations, 13 figures, 18 tables.

Figures (13)

  • Figure 1: Linear probing and class unconditional generation performance of different methods trained and evaluated on ImageNet-1K. MAGE achieves SOTA performance in linear probing and establishes a new SOTA in class unconditional generation.
  • Figure 2: Reconstruction results using MAE and MAGE with 75% masking ratio. MAE reconstructs blurry images with low quality, while MAGE can reconstruct high-quality images with detail, and further improves quality through iterative decoding (see \ref{['sec:evaluation']} for details). With the same mask, MAGE generates diverse reconstruction results with different random seeds. Note that the mask for MAGE is on semantic tokens whereas that of MAE is on patches in the input image.
  • Figure 3: MAGE Framework: we first use a VQGAN tokenizer to tokenize the input image into a sequence of semantic tokens. We then sample a masking ratio (see text for details on the sampling strategy) and randomly mask out tokens according to this sampled ratio. A ViT encoder-decoder structure processes the unmasked tokens. A reconstructive cross-entropy loss encourages the model to reconstruct masked tokens. We can also add an optional contrastive loss at the output of the encoder to further improve the linear separability of the learned latent feature space.
  • Figure 4: Images generated by MAGE (ViT-L). (a) images generated from MAGE trained with default strong augmentation, i.e., crops out larger portion of the image. (b) images generated from MAGE trained with weak augmentations, i.e., crops out smaller portion of the image. We see that visual fidelity and diversity are very good for both models.
  • Figure 5: Transfer learning performance of ViT-B and ViT-L pre-trained on ImageNet-1K using different methods. Our methods outperforms SimCLR simclr and MAE MAE on 6 of the 8 datasets.
  • ...and 8 more figures