Table of Contents
Fetching ...

Pre-training of Lightweight Vision Transformers on Small Datasets with Minimally Scaled Images

Jen Hong Tan

TL;DR

This work investigates whether lightweight Vision Transformers can outperform CNNs on small datasets with low-resolution images. By pre-training a compact ViT using Masked Auto-Encoder while keeping images near their original scale (36x36) and with minimal parameter counts, the authors demonstrate state-of-the-art results on CIFAR-10/100 relative to similarly sized transformers. The approach hinges on a patch-based MAE with 0.75 masking, separate learnable positional embeddings for encoder/decoder, and a discriminative reconstruction loss that includes both masked and discounted unmasked patches. The findings indicate that MAE pre-training is highly sample-efficient for small datasets and that same-dataset pre-training yields the best fine-tuning performance, with practical implications for deploying lightweight ViTs in data-constrained scenarios.

Abstract

Can a lightweight Vision Transformer (ViT) match or exceed the performance of Convolutional Neural Networks (CNNs) like ResNet on small datasets with small image resolutions? This report demonstrates that a pure ViT can indeed achieve superior performance through pre-training, using a masked auto-encoder technique with minimal image scaling. Our experiments on the CIFAR-10 and CIFAR-100 datasets involved ViT models with fewer than 3.65 million parameters and a multiply-accumulate (MAC) count below 0.27G, qualifying them as 'lightweight' models. Unlike previous approaches, our method attains state-of-the-art performance among similar lightweight transformer-based architectures without significantly scaling up images from CIFAR-10 and CIFAR-100. This achievement underscores the efficiency of our model, not only in handling small datasets but also in effectively processing images close to their original scale.

Pre-training of Lightweight Vision Transformers on Small Datasets with Minimally Scaled Images

TL;DR

This work investigates whether lightweight Vision Transformers can outperform CNNs on small datasets with low-resolution images. By pre-training a compact ViT using Masked Auto-Encoder while keeping images near their original scale (36x36) and with minimal parameter counts, the authors demonstrate state-of-the-art results on CIFAR-10/100 relative to similarly sized transformers. The approach hinges on a patch-based MAE with 0.75 masking, separate learnable positional embeddings for encoder/decoder, and a discriminative reconstruction loss that includes both masked and discounted unmasked patches. The findings indicate that MAE pre-training is highly sample-efficient for small datasets and that same-dataset pre-training yields the best fine-tuning performance, with practical implications for deploying lightweight ViTs in data-constrained scenarios.

Abstract

Can a lightweight Vision Transformer (ViT) match or exceed the performance of Convolutional Neural Networks (CNNs) like ResNet on small datasets with small image resolutions? This report demonstrates that a pure ViT can indeed achieve superior performance through pre-training, using a masked auto-encoder technique with minimal image scaling. Our experiments on the CIFAR-10 and CIFAR-100 datasets involved ViT models with fewer than 3.65 million parameters and a multiply-accumulate (MAC) count below 0.27G, qualifying them as 'lightweight' models. Unlike previous approaches, our method attains state-of-the-art performance among similar lightweight transformer-based architectures without significantly scaling up images from CIFAR-10 and CIFAR-100. This achievement underscores the efficiency of our model, not only in handling small datasets but also in effectively processing images close to their original scale.
Paper Structure (14 sections, 2 equations, 6 figures, 3 tables)

This paper contains 14 sections, 2 equations, 6 figures, 3 tables.

Figures (6)

  • Figure 1: Modified MAE: Implementation using separate learnable positional embeddings for Encoder and Decoder. The Encoder and the Decoder in this figure only consist of transformer layers, a layer of layerNorm and a linear projection layer.
  • Figure 2: The architecture of our MAE. 'B' stands for batch size, 'N' for number of embeddings
  • Figure 3: Pre-training loss for both CIFAR-10 and CIFAR-100 datasets. For CIFAR10, the training loss begins at 0.63 and concludes at 0.0207, while for CIFAR100, it starts at 0.66 and finishes at 0.0197. Notably, the y-axis of the plots is limited to the range of around 0.02 to 0.04 to enhance the visibility of the loss trends during most of the epochs. As a result of this scale adjustment, the higher training losses observed in the early epochs are not displayed in this figure.
  • Figure 4: Example results on CIFAR10 and CIFAR100 validation images. For each triplet, on the left is the original image. The middle is the masked image, and the right is the reconstructed image by MAE.
  • Figure 5: The evolution of the reconstructed outputs by the Masked Auto-Encoder at different training epochs: 20, 1020, 2020, and 3020. The left column displays a sample image from the CIFAR-10 training set, while the right column shows a corresponding sample from the CIFAR-100 training set. Each row corresponds to the reconstruction quality at the specified epoch, demonstrating the progressive refinement of the model's output over time.
  • ...and 1 more figures