MSCViT: A Small-size ViT architecture with Multi-Scale Self-Attention Mechanism for Tiny Datasets
Bowei Zhang, Yi Zhang
TL;DR
The paper addresses the challenge of applying Vision Transformers to tiny datasets by introducing MSCViT, a hybrid architecture that integrates Local Feature Extraction, Lightweight Multi-scale Self-Attention, and Convolutional Feature Fusion to inject locality and inductive bias. It demonstrates that replacing fixed positional encoding with LFE, along with multi-scale attention and selective convolutional fusion, yields competitive results without pretraining, achieving 84.68% top-1 on CIFAR-100 and 72.11% on Tiny ImageNet. Ablation studies confirm the individual and synergistic contributions of LFE, LMSSA, and CFF, and show favorable efficiency trade-offs. The work provides a practical path toward mobile-friendly Transformers for small data regimes and offers concrete design guidelines for balancing locality and global modeling.
Abstract
Vision Transformer (ViT) has demonstrated significant potential in various vision tasks due to its strong ability in modelling long-range dependencies. However, such success is largely fueled by training on massive samples. In real applications, the large-scale datasets are not always available, and ViT performs worse than Convolutional Neural Networks (CNNs) if it is only trained on small scale dataset (called tiny dataset), since it requires large amount of training data to ensure its representational capacity. In this paper, a small-size ViT architecture with multi-scale self-attention mechanism and convolution blocks is presented (dubbed MSCViT) to model different scales of attention at each layer. Firstly, we introduced wavelet convolution, which selectively combines the high-frequency components obtained by frequency division with our convolution channel to extract local features. Then, a lightweight multi-head attention module is developed to reduce the number of tokens and computational costs. Finally, the positional encoding (PE) in the backbone is replaced by a local feature extraction module. Compared with the original ViT, it is parameter-efficient and is particularly suitable for tiny datasets. Extensive experiments have been conducted on tiny datasets, in which our model achieves an accuracy of 84.68% on CIFAR-100 with 14.0M parameters and 2.5 GFLOPs, without pre-training on large datasets.
