Table of Contents
Fetching ...

HSViT: Horizontally Scalable Vision Transformer

Chenhao Xu, Chang-Tsun Li, Chee Peng Lim, Douglas Creighton

TL;DR

HSViT addresses the limits of Vision Transformers that rely on large-scale pre-training by preserving inductive bias through an image-level feature embedding and enabling horizontal scalability with a distributed self-attention design. The method allows modules to run on multiple GPUs with minimal inter-device communication, aggregating predictions through CLS token voting. Empirically, HSViT delivers up to $10 ext{\%}$ improvements on small datasets without pre-training and up to $3.1 ext{\%}$ gains when integrated with CNN backbones on ImageNet-1k, broadening ViT applicability to resource-constrained settings. This approach opens pathways for deploying ViTs in distributed and cloud environments while leveraging existing CNN backbones for enhanced performance.

Abstract

Due to its deficiency in prior knowledge (inductive bias), Vision Transformer (ViT) requires pre-training on large-scale datasets to perform well. Moreover, the growing layers and parameters in ViT models impede their applicability to devices with limited computing resources. To mitigate the aforementioned challenges, this paper introduces a novel horizontally scalable vision transformer (HSViT) scheme. Specifically, a novel image-level feature embedding is introduced to ViT, where the preserved inductive bias allows the model to eliminate the need for pre-training while outperforming on small datasets. Besides, a novel horizontally scalable architecture is designed, facilitating collaborative model training and inference across multiple computing devices. The experimental results depict that, without pre-training, HSViT achieves up to 10% higher top-1 accuracy than state-of-the-art schemes on small datasets, while providing existing CNN backbones up to 3.1% improvement in top-1 accuracy on ImageNet. The code is available at https://github.com/xuchenhao001/HSViT.

HSViT: Horizontally Scalable Vision Transformer

TL;DR

HSViT addresses the limits of Vision Transformers that rely on large-scale pre-training by preserving inductive bias through an image-level feature embedding and enabling horizontal scalability with a distributed self-attention design. The method allows modules to run on multiple GPUs with minimal inter-device communication, aggregating predictions through CLS token voting. Empirically, HSViT delivers up to improvements on small datasets without pre-training and up to gains when integrated with CNN backbones on ImageNet-1k, broadening ViT applicability to resource-constrained settings. This approach opens pathways for deploying ViTs in distributed and cloud environments while leveraging existing CNN backbones for enhanced performance.

Abstract

Due to its deficiency in prior knowledge (inductive bias), Vision Transformer (ViT) requires pre-training on large-scale datasets to perform well. Moreover, the growing layers and parameters in ViT models impede their applicability to devices with limited computing resources. To mitigate the aforementioned challenges, this paper introduces a novel horizontally scalable vision transformer (HSViT) scheme. Specifically, a novel image-level feature embedding is introduced to ViT, where the preserved inductive bias allows the model to eliminate the need for pre-training while outperforming on small datasets. Besides, a novel horizontally scalable architecture is designed, facilitating collaborative model training and inference across multiple computing devices. The experimental results depict that, without pre-training, HSViT achieves up to 10% higher top-1 accuracy than state-of-the-art schemes on small datasets, while providing existing CNN backbones up to 3.1% improvement in top-1 accuracy on ImageNet. The code is available at https://github.com/xuchenhao001/HSViT.
Paper Structure (12 sections, 3 equations, 8 figures, 4 tables)

This paper contains 12 sections, 3 equations, 8 figures, 4 tables.

Figures (8)

  • Figure 1: Comparison of various hybrid ViT architectures. HSViT can be deployed across multiple computing devices to utilize resources better.
  • Figure 2: Number of parameters vs. Tiny-ImageNet top-1 accuracy (%). HSViT achieves higher top-1 accuracy when solely trained and tested on Tiny-ImageNet (a small dataset) due to its better preservation of inductive bias.
  • Figure 3: Feature processing pipeline of HSViT.
  • Figure 4: GPU utilization comparison between model parallelism (MP), pipeline parallelism (PP), and HSViT.
  • Figure 5: Number of convolutional kernels vs. number of attention groups. As the number of convolutional kernels and attention groups grows, the top-1 accuracy on CIFAR-10 rises.
  • ...and 3 more figures