Table of Contents
Fetching ...

CL-MAE: Curriculum-Learned Masked Autoencoders

Neelu Madan, Nicolae-Catalin Ristea, Kamal Nasrollahi, Thomas B. Moeslund, Radu Tudor Ionescu

TL;DR

The Curriculum-Learned Masked Autoencoder (CL-MAE) is trained on ImageNet and it exhibits superior representation learning capabilities compared to MAE, demonstrating that curriculum learning can be successfully used to self-supervise masked autoencoders.

Abstract

Masked image modeling has been demonstrated as a powerful pretext task for generating robust representations that can be effectively generalized across multiple downstream tasks. Typically, this approach involves randomly masking patches (tokens) in input images, with the masking strategy remaining unchanged during training. In this paper, we propose a curriculum learning approach that updates the masking strategy to continually increase the complexity of the self-supervised reconstruction task. We conjecture that, by gradually increasing the task complexity, the model can learn more sophisticated and transferable representations. To facilitate this, we introduce a novel learnable masking module that possesses the capability to generate masks of different complexities, and integrate the proposed module into masked autoencoders (MAE). Our module is jointly trained with the MAE, while adjusting its behavior during training, transitioning from a partner to the MAE (optimizing the same reconstruction loss) to an adversary (optimizing the opposite loss), while passing through a neutral state. The transition between these behaviors is smooth, being regulated by a factor that is multiplied with the reconstruction loss of the masking module. The resulting training procedure generates an easy-to-hard curriculum. We train our Curriculum-Learned Masked Autoencoder (CL-MAE) on ImageNet and show that it exhibits superior representation learning capabilities compared to MAE. The empirical results on five downstream tasks confirm our conjecture, demonstrating that curriculum learning can be successfully used to self-supervise masked autoencoders. We release our code at https://github.com/ristea/cl-mae.

CL-MAE: Curriculum-Learned Masked Autoencoders

TL;DR

The Curriculum-Learned Masked Autoencoder (CL-MAE) is trained on ImageNet and it exhibits superior representation learning capabilities compared to MAE, demonstrating that curriculum learning can be successfully used to self-supervise masked autoencoders.

Abstract

Masked image modeling has been demonstrated as a powerful pretext task for generating robust representations that can be effectively generalized across multiple downstream tasks. Typically, this approach involves randomly masking patches (tokens) in input images, with the masking strategy remaining unchanged during training. In this paper, we propose a curriculum learning approach that updates the masking strategy to continually increase the complexity of the self-supervised reconstruction task. We conjecture that, by gradually increasing the task complexity, the model can learn more sophisticated and transferable representations. To facilitate this, we introduce a novel learnable masking module that possesses the capability to generate masks of different complexities, and integrate the proposed module into masked autoencoders (MAE). Our module is jointly trained with the MAE, while adjusting its behavior during training, transitioning from a partner to the MAE (optimizing the same reconstruction loss) to an adversary (optimizing the opposite loss), while passing through a neutral state. The transition between these behaviors is smooth, being regulated by a factor that is multiplied with the reconstruction loss of the masking module. The resulting training procedure generates an easy-to-hard curriculum. We train our Curriculum-Learned Masked Autoencoder (CL-MAE) on ImageNet and show that it exhibits superior representation learning capabilities compared to MAE. The empirical results on five downstream tasks confirm our conjecture, demonstrating that curriculum learning can be successfully used to self-supervise masked autoencoders. We release our code at https://github.com/ristea/cl-mae.
Paper Structure (14 sections, 7 equations, 5 figures, 9 tables)

This paper contains 14 sections, 7 equations, 5 figures, 9 tables.

Figures (5)

  • Figure 1: Our Curriculum-Learned Masked Autoencoder (CL-MAE) comprises a learnable masking module that decides what tokens need to be masked at each training iteration. The architecture of our module uses $N$ vision transformer (ViT) Dosovitskiy-ICLR-2021 blocks based on multi-head attention (MHA), layer normalization (LN) and multi-layer perceptrons (MLPs). The final [CLS] token is passed through an MLP, a linear projection and a sigmoid activation ($\sigma$), producing token masking probabilities. The masking module uses an easy-to-hard curriculum learning schedule that transitions smoothly from optimizing the same reconstruction objective as the MAE to an adversarial (opposed) objective. Hence, our masking module generates more or less complex masks, depending on its current objective. Our curriculum masking module (CMM) and the MAE He-CVPR-2022 are trained in alternating steps, similar to how generative adversarial networks Goodfellow-NIPS-2014 are trained. During inference, the masking module is removed. Best viewed in color.
  • Figure 2: Masks generated by our masking module at two different moments during training, when all losses are in place, for the images on the top row. The masks on the second row are generated halfway during training, when the masking module is still acting as a partner to the MAE. In contrast, the masks on the bottom row are generated in the last epoch, when the masking module is behaving as an adversary to the MAE. Our module shifts its preference from masking non-salient tokens to masking tokens situated on edges and object contours, generating an easy-to-hard curriculum for the MAE.
  • Figure 3: Masks generated by the proposed masking module without (left) and with (right) adding the diversity loss ($\mathcal{L}_{div}$). If the diversity loss ($\mathcal{L}_{div}$) is not included, the masking module can enter mode collapse and produce nearly identical masks. This can lead to overfitting CL-MAE on reconstructing certain patch configurations. The effect is no longer observed when the proposed diversity loss is employed.
  • Figure 4: Masks generated by the proposed masking module during training, without the Kullback-Leibler loss. The masks evolve from leaving all patches visible (to reduce the reconstruction error for the MAE) to hiding all patches (to increase the reconstruction error for the MAE). The Kullback-Leibler loss is required to make sure the model always masks the desired number of patches.
  • Figure 5: Few-shot linear probing results for MAE He-CVPR-2022 and CL-MAE (ours) based on various backbones (ViT-B, ViT-L, ViT-H). The number of training samples per class varies between $1$ and $16$. The reported accuracy rates are averaged over three runs. Best viewed in color.