ColorMAE: Exploring data-independent masking strategies in Masked AutoEncoders

Carlos Hinojosa; Shuming Liu; Bernard Ghanem

ColorMAE: Exploring data-independent masking strategies in Masked AutoEncoders

Carlos Hinojosa, Shuming Liu, Bernard Ghanem

TL;DR

A simple yet effective data-independent method, termed ColorMAE, which generates different binary mask patterns by filtering random noise, drawing inspiration from color noise in image processing, which requires no additional learnable parameters or computational overhead in the network, yet it significantly enhances the learned representations.

Abstract

Masked AutoEncoders (MAE) have emerged as a robust self-supervised framework, offering remarkable performance across a wide range of downstream tasks. To increase the difficulty of the pretext task and learn richer visual representations, existing works have focused on replacing standard random masking with more sophisticated strategies, such as adversarial-guided and teacher-guided masking. However, these strategies depend on the input data thus commonly increasing the model complexity and requiring additional calculations to generate the mask patterns. This raises the question: Can we enhance MAE performance beyond random masking without relying on input data or incurring additional computational costs? In this work, we introduce a simple yet effective data-independent method, termed ColorMAE, which generates different binary mask patterns by filtering random noise. Drawing inspiration from color noise in image processing, we explore four types of filters to yield mask patterns with different spatial and semantic priors. ColorMAE requires no additional learnable parameters or computational overhead in the network, yet it significantly enhances the learned representations. We provide a comprehensive empirical evaluation, demonstrating our strategy's superiority in downstream tasks compared to random masking. Notably, we report an improvement of 2.72 in mIoU in semantic segmentation tasks relative to baseline MAE implementations.

ColorMAE: Exploring data-independent masking strategies in Masked AutoEncoders

TL;DR

Abstract

Paper Structure (10 sections, 4 equations, 6 figures, 4 tables, 1 algorithm)

This paper contains 10 sections, 4 equations, 6 figures, 4 tables, 1 algorithm.

Introduction
Related Works
Self-supervised Learning
Masking Strategy
Proposed Method
Experiments
Exploring Masking Strategies Performance
Comparison with Other Methods
Analysis
Conclusions

Figures (6)

Figure 1: We use the MAEhe2022masked to mask and reconstruct an input image using different masking strategies with a masking ratio of 75%. The results of the first three columns shown in the top row correspond to traditional data-independent strategies: random, block-wise, and grid-wise masking, respectively. The last column of the top row shows an example of adaptive masking using an attention-based mechanism. The second row shows our four distinct types of masks generated by Color71,143,23573,187,121MAE156,69,239251,67,55 when filtering random noise with high-pass (Blue), band-pass (Green), band-stop (Purple), and low-pass (Red) filters.
Figure 2: Starting with random noise , we apply four filters : high-pass (Blue), band-pass (Green), band-stop (Purple), and low-pass (Red) to produce the filtered noises . The periodogram of each filtered noise is displayed in the second row; as observed, each version of exhibits a distinctive pattern in the frequency domain. Once the filtered noise is obtained, we perform a random crop on to obtain a local window (sized to match the total number of patches) and select the top values according to the desired mask ratio (e.g., 75%) to create the binary mask used during pre-training.
Figure 3: Reconstruction results on ImageNet validation images from MAE pre-trained during 300 epochs with our four generated masks: Blue, Green, Purple, and Red.
Figure 4: MAE pre-training loss for different masking strategies with ViT-B.
Figure 5: Self-attention of the [CLS] tokens averaged across the heads of the last layer in MAE pre-trained using random masking and our proposed Green masking approach (ColorMAE-G). We show attention maps on images from Imagenet-1Krussakovsky2015imagenet(1st-3rd columns), Microsoft COCO lin2014microsoft(4th-6th columns) and ADE20K zhou2017scene(7th-9th columns) datasets. Both MAE and ColorMAE-G are pre-trained on ImageNet-1K for 300 epochs. Please refer to our supplementary for more visualizations of the attention maps when pre-training MAE with other Color71,143,23573,187,121MAE156,69,239251,67,55 masks.
...and 1 more figures

ColorMAE: Exploring data-independent masking strategies in Masked AutoEncoders

TL;DR

Abstract

ColorMAE: Exploring data-independent masking strategies in Masked AutoEncoders

Authors

TL;DR

Abstract

Table of Contents

Figures (6)