HSViT: Horizontally Scalable Vision Transformer
Chenhao Xu, Chang-Tsun Li, Chee Peng Lim, Douglas Creighton
TL;DR
HSViT addresses the limits of Vision Transformers that rely on large-scale pre-training by preserving inductive bias through an image-level feature embedding and enabling horizontal scalability with a distributed self-attention design. The method allows modules to run on multiple GPUs with minimal inter-device communication, aggregating predictions through CLS token voting. Empirically, HSViT delivers up to $10 ext{\%}$ improvements on small datasets without pre-training and up to $3.1 ext{\%}$ gains when integrated with CNN backbones on ImageNet-1k, broadening ViT applicability to resource-constrained settings. This approach opens pathways for deploying ViTs in distributed and cloud environments while leveraging existing CNN backbones for enhanced performance.
Abstract
Due to its deficiency in prior knowledge (inductive bias), Vision Transformer (ViT) requires pre-training on large-scale datasets to perform well. Moreover, the growing layers and parameters in ViT models impede their applicability to devices with limited computing resources. To mitigate the aforementioned challenges, this paper introduces a novel horizontally scalable vision transformer (HSViT) scheme. Specifically, a novel image-level feature embedding is introduced to ViT, where the preserved inductive bias allows the model to eliminate the need for pre-training while outperforming on small datasets. Besides, a novel horizontally scalable architecture is designed, facilitating collaborative model training and inference across multiple computing devices. The experimental results depict that, without pre-training, HSViT achieves up to 10% higher top-1 accuracy than state-of-the-art schemes on small datasets, while providing existing CNN backbones up to 3.1% improvement in top-1 accuracy on ImageNet. The code is available at https://github.com/xuchenhao001/HSViT.
