Pre-training of Lightweight Vision Transformers on Small Datasets with Minimally Scaled Images
Jen Hong Tan
TL;DR
This work investigates whether lightweight Vision Transformers can outperform CNNs on small datasets with low-resolution images. By pre-training a compact ViT using Masked Auto-Encoder while keeping images near their original scale (36x36) and with minimal parameter counts, the authors demonstrate state-of-the-art results on CIFAR-10/100 relative to similarly sized transformers. The approach hinges on a patch-based MAE with 0.75 masking, separate learnable positional embeddings for encoder/decoder, and a discriminative reconstruction loss that includes both masked and discounted unmasked patches. The findings indicate that MAE pre-training is highly sample-efficient for small datasets and that same-dataset pre-training yields the best fine-tuning performance, with practical implications for deploying lightweight ViTs in data-constrained scenarios.
Abstract
Can a lightweight Vision Transformer (ViT) match or exceed the performance of Convolutional Neural Networks (CNNs) like ResNet on small datasets with small image resolutions? This report demonstrates that a pure ViT can indeed achieve superior performance through pre-training, using a masked auto-encoder technique with minimal image scaling. Our experiments on the CIFAR-10 and CIFAR-100 datasets involved ViT models with fewer than 3.65 million parameters and a multiply-accumulate (MAC) count below 0.27G, qualifying them as 'lightweight' models. Unlike previous approaches, our method attains state-of-the-art performance among similar lightweight transformer-based architectures without significantly scaling up images from CIFAR-10 and CIFAR-100. This achievement underscores the efficiency of our model, not only in handling small datasets but also in effectively processing images close to their original scale.
