MSCViT: A Small-size ViT architecture with Multi-Scale Self-Attention Mechanism for Tiny Datasets

Bowei Zhang; Yi Zhang

MSCViT: A Small-size ViT architecture with Multi-Scale Self-Attention Mechanism for Tiny Datasets

Bowei Zhang, Yi Zhang

TL;DR

The paper addresses the challenge of applying Vision Transformers to tiny datasets by introducing MSCViT, a hybrid architecture that integrates Local Feature Extraction, Lightweight Multi-scale Self-Attention, and Convolutional Feature Fusion to inject locality and inductive bias. It demonstrates that replacing fixed positional encoding with LFE, along with multi-scale attention and selective convolutional fusion, yields competitive results without pretraining, achieving 84.68% top-1 on CIFAR-100 and 72.11% on Tiny ImageNet. Ablation studies confirm the individual and synergistic contributions of LFE, LMSSA, and CFF, and show favorable efficiency trade-offs. The work provides a practical path toward mobile-friendly Transformers for small data regimes and offers concrete design guidelines for balancing locality and global modeling.

Abstract

Vision Transformer (ViT) has demonstrated significant potential in various vision tasks due to its strong ability in modelling long-range dependencies. However, such success is largely fueled by training on massive samples. In real applications, the large-scale datasets are not always available, and ViT performs worse than Convolutional Neural Networks (CNNs) if it is only trained on small scale dataset (called tiny dataset), since it requires large amount of training data to ensure its representational capacity. In this paper, a small-size ViT architecture with multi-scale self-attention mechanism and convolution blocks is presented (dubbed MSCViT) to model different scales of attention at each layer. Firstly, we introduced wavelet convolution, which selectively combines the high-frequency components obtained by frequency division with our convolution channel to extract local features. Then, a lightweight multi-head attention module is developed to reduce the number of tokens and computational costs. Finally, the positional encoding (PE) in the backbone is replaced by a local feature extraction module. Compared with the original ViT, it is parameter-efficient and is particularly suitable for tiny datasets. Extensive experiments have been conducted on tiny datasets, in which our model achieves an accuracy of 84.68% on CIFAR-100 with 14.0M parameters and 2.5 GFLOPs, without pre-training on large datasets.

MSCViT: A Small-size ViT architecture with Multi-Scale Self-Attention Mechanism for Tiny Datasets

TL;DR

Abstract

Paper Structure (23 sections, 8 equations, 5 figures, 8 tables)

This paper contains 23 sections, 8 equations, 5 figures, 8 tables.

Introduction
Related works
Vision Transformer (ViT)
Introducing Convolutions to Transformer
ViT for tiny datasets
Method
Overall Architecture
Local Feature Extraction (LFE)
Lightweight Multi-scale Self-Attention (LMSSA)
Convolutional Feature Fusion (CFF)
Scaling Strategy
Experiment
Datasets
Experiment Settings
Results
...and 8 more sections

Figures (5)

Figure 1: Performance of MSCViT on CIFAR-10 and CIFAR-100. MSCViT performs better than some models with similar structures.
Figure 2: The overall architecture of the proposed MSCViT.
Figure 3: The comparison of the model sizes and accuracies among different methods.
Figure 4: The comparison of heatmaps of different methods generated by Grad-CAM.
Figure 5: Visual demonstration of the functions of the proposed LFE, CFF and LMSSA modules.

MSCViT: A Small-size ViT architecture with Multi-Scale Self-Attention Mechanism for Tiny Datasets

TL;DR

Abstract

MSCViT: A Small-size ViT architecture with Multi-Scale Self-Attention Mechanism for Tiny Datasets

Authors

TL;DR

Abstract

Table of Contents

Figures (5)