DailyMAE: Towards Pretraining Masked Autoencoders in One Day

Jiantao Wu; Shentong Mo; Sara Atito; Zhenhua Feng; Josef Kittler; Muhammad Awais

DailyMAE: Towards Pretraining Masked Autoencoders in One Day

Jiantao Wu, Shentong Mo, Sara Atito, Zhenhua Feng, Josef Kittler, Muhammad Awais

TL;DR

This paper tackles the high computational cost of pretraining masked image modeling (MIM) for self-supervised learning by introducing efficient recipes that mitigate data-loading bottlenecks and apply progressive training. It presents an enhanced FFCV-based data pipeline (ESSL) and progressive resolution strategies to accelerate MAE pretraining, achieving MAE-Base/16 on ImageNet-1K in 18–17 hours on a single machine with multiple GPUs and up to 5.8× speedups. The authors also propose a comprehensive finetuning recipe with Three Augmentations and standardized validation, and they systematically study compression, data shifts, and dynamic resizing during both finetuning and pretraining. The resulting framework lowers the barrier to SSL research and rapid prototyping, while highlighting trade-offs in data compression and resolution that influence accuracy and efficiency. Overall, the work provides a practical toolkit for fast, iterative SSL experimentation on limited hardware and fosters broader accessibility for MAE-style pretraining research.

Abstract

Recently, masked image modeling (MIM), an important self-supervised learning (SSL) method, has drawn attention for its effectiveness in learning data representation from unlabeled data. Numerous studies underscore the advantages of MIM, highlighting how models pretrained on extensive datasets can enhance the performance of downstream tasks. However, the high computational demands of pretraining pose significant challenges, particularly within academic environments, thereby impeding the SSL research progress. In this study, we propose efficient training recipes for MIM based SSL that focuses on mitigating data loading bottlenecks and employing progressive training techniques and other tricks to closely maintain pretraining performance. Our library enables the training of a MAE-Base/16 model on the ImageNet 1K dataset for 800 epochs within just 18 hours, using a single machine equipped with 8 A100 GPUs. By achieving speed gains of up to 5.8 times, this work not only demonstrates the feasibility of conducting high-efficiency SSL training but also paves the way for broader accessibility and promotes advancement in SSL research particularly for prototyping and initial testing of SSL ideas. The code is available in https://github.com/erow/FastSSL.

DailyMAE: Towards Pretraining Masked Autoencoders in One Day

TL;DR

Abstract

Paper Structure (30 sections, 3 equations, 6 figures, 13 tables)

This paper contains 30 sections, 3 equations, 6 figures, 13 tables.

Introduction
Related Work
Data Loading Library.
Masked Image Modeling.
Progressive Training.
Efficient Masked Autoencoder
Fintuning Recipe.
Pretraining Recipe.
Machine Specification
Removing the Data Loading Bottleneck.
Crop Decode
Compression Parameters for Building an FFCV Dataset
Compression Shift
Three Augmentation for Compression Shift Mitigation.
Discussion
...and 15 more sections

Figures (6)

Figure 1: Time consumption for loading and training one epoch (1,281,167 images). The runtime of pretraining MAE-B/16 is measured without data loading.
Figure 2: Online prob for pretraining mae-base/16 from scratch with respect to the training time (x-axis) on a single machine with 8 A100s. Each point denotes the total runtime and final accuracy. "ESSL_s4" denotes our improved FFCV with dynamic resolution.
Figure 3: Comparison of Throughput between FFCV and ESSL. '-' denotes no resizing when building the dataset. Maximum image size refers to the largest side (width or height) an image can have after being resized.
Figure 4: Example of perceptual ratio and apparent size. The perceptual ratio denotes the seen percentage of objects. The apparent size denotes the pixel size for objects.
Figure 5: Training benchmark of throughput (images/s) on two platforms, including data loading, forward, backward, and optimization. GPU 1 and 8 denote the pure training processes without data loading.
...and 1 more figures

DailyMAE: Towards Pretraining Masked Autoencoders in One Day

TL;DR

Abstract

DailyMAE: Towards Pretraining Masked Autoencoders in One Day

Authors

TL;DR

Abstract

Table of Contents

Figures (6)