Powerful Design of Small Vision Transformer on CIFAR10
Gent Wu
TL;DR
This work targets efficient Tiny Vision Transformers for CIFAR-10, addressing the performance gap on small datasets by investigating data augmentation, patch token initialization, low-rank attention via Multi-Latent Attention (MLA), and multi-class token strategies. It demonstrates that low-rank compression of queries incurs minimal accuracy loss and that increasing CLS-token capacity via Multi-Class Tokens significantly improves global representation and accuracy. The paper provides a practical design framework, including ablations on augmentation, initialization, and optimizers, and reports that careful choices (e.g., learnable positional embeddings, Lion optimizer) yield competitive results with reduced computational cost. The findings offer actionable guidance for building scalable, efficient Tiny ViTs on small datasets, with code available at the referenced repository.
Abstract
Vision Transformers (ViTs) have demonstrated remarkable success on large-scale datasets, but their performance on smaller datasets often falls short of convolutional neural networks (CNNs). This paper explores the design and optimization of Tiny ViTs for small datasets, using CIFAR-10 as a benchmark. We systematically evaluate the impact of data augmentation, patch token initialization, low-rank compression, and multi-class token strategies on model performance. Our experiments reveal that low-rank compression of queries in Multi-Head Latent Attention (MLA) incurs minimal performance loss, indicating redundancy in ViTs. Additionally, introducing multiple CLS tokens improves global representation capacity, boosting accuracy. These findings provide a comprehensive framework for optimizing Tiny ViTs, offering practical insights for efficient and effective designs. Code is available at https://github.com/erow/PoorViTs.
