Table of Contents
Fetching ...

HydraViT: Stacking Heads for a Scalable ViT

Janek Haberer, Ali Hojjat, Olaf Landsiedel

TL;DR

HydraViT tackles the challenge of deploying Vision Transformers on devices with diverse resource profiles by introducing a universal ViT that intrinsically contains multiple subnetworks. It trains a single model with $H$ attention heads and embedding dimension $E$ using a stochastic scheme that repeatedly activates the first $k$ heads and corresponding embedding slices, enabling runtime selection of the subnetwork based on available hardware. The approach is augmented with optional separate classifiers and a weighted subnetwork sampling strategy to improve per-subnetwork accuracy. Experimental results on ImageNet-1K show HydraViT can match or exceed the accuracy of independently trained DeiT tiny/small/base models at the same GMACs or throughput, and can support up to 10 subnetworks, with robustness across several ImageNet variants. While HydraViT increases training complexity due to multi-subnetwork optimization, it reduces total training time relative to training multiple distinct models and offers a practical pathway for scalable, device-aware ViT deployment.

Abstract

The architecture of Vision Transformers (ViTs), particularly the Multi-head Attention (MHA) mechanism, imposes substantial hardware demands. Deploying ViTs on devices with varying constraints, such as mobile phones, requires multiple models of different sizes. However, this approach has limitations, such as training and storing each required model separately. This paper introduces HydraViT, a novel approach that addresses these limitations by stacking attention heads to achieve a scalable ViT. By repeatedly changing the size of the embedded dimensions throughout each layer and their corresponding number of attention heads in MHA during training, HydraViT induces multiple subnetworks. Thereby, HydraViT achieves adaptability across a wide spectrum of hardware environments while maintaining performance. Our experimental results demonstrate the efficacy of HydraViT in achieving a scalable ViT with up to 10 subnetworks, covering a wide range of resource constraints. HydraViT achieves up to 5 p.p. more accuracy with the same GMACs and up to 7 p.p. more accuracy with the same throughput on ImageNet-1K compared to the baselines, making it an effective solution for scenarios where hardware availability is diverse or varies over time. Source code available at https://github.com/ds-kiel/HydraViT.

HydraViT: Stacking Heads for a Scalable ViT

TL;DR

HydraViT tackles the challenge of deploying Vision Transformers on devices with diverse resource profiles by introducing a universal ViT that intrinsically contains multiple subnetworks. It trains a single model with attention heads and embedding dimension using a stochastic scheme that repeatedly activates the first heads and corresponding embedding slices, enabling runtime selection of the subnetwork based on available hardware. The approach is augmented with optional separate classifiers and a weighted subnetwork sampling strategy to improve per-subnetwork accuracy. Experimental results on ImageNet-1K show HydraViT can match or exceed the accuracy of independently trained DeiT tiny/small/base models at the same GMACs or throughput, and can support up to 10 subnetworks, with robustness across several ImageNet variants. While HydraViT increases training complexity due to multi-subnetwork optimization, it reduces total training time relative to training multiple distinct models and offers a practical pathway for scalable, device-aware ViT deployment.

Abstract

The architecture of Vision Transformers (ViTs), particularly the Multi-head Attention (MHA) mechanism, imposes substantial hardware demands. Deploying ViTs on devices with varying constraints, such as mobile phones, requires multiple models of different sizes. However, this approach has limitations, such as training and storing each required model separately. This paper introduces HydraViT, a novel approach that addresses these limitations by stacking attention heads to achieve a scalable ViT. By repeatedly changing the size of the embedded dimensions throughout each layer and their corresponding number of attention heads in MHA during training, HydraViT induces multiple subnetworks. Thereby, HydraViT achieves adaptability across a wide spectrum of hardware environments while maintaining performance. Our experimental results demonstrate the efficacy of HydraViT in achieving a scalable ViT with up to 10 subnetworks, covering a wide range of resource constraints. HydraViT achieves up to 5 p.p. more accuracy with the same GMACs and up to 7 p.p. more accuracy with the same throughput on ImageNet-1K compared to the baselines, making it an effective solution for scenarios where hardware availability is diverse or varies over time. Source code available at https://github.com/ds-kiel/HydraViT.
Paper Structure (25 sections, 4 equations, 12 figures, 7 tables, 1 algorithm)

This paper contains 25 sections, 4 equations, 12 figures, 7 tables, 1 algorithm.

Figures (12)

  • Figure 1: Performance comparison of HydraViT and baselines on ImageNet-1K in terms of GMACs (a) and throughput (b) evaluated on NVIDIA A100 80GB PCIe. HydraViT trained on 3-12 heads demonstrates superior performance over DynaBERT hou2020dynabert and SortedNet valipour2023sortednet. While MatFormer kudugunta2023matformer shows higher performance than HydraViT within its limited scalability range, but when we train on a narrower scalability range (9-12 heads), HydraViT surpasses MatFormer. We also show that training HydraViT for more epochs can further improve accuracy. Note that each line corresponds to one model, and changing the number of heads in the vanilla DeiT models significantly drops their accuracy to less than 30%.
  • Figure 2: Architecture of HydraViT
  • Figure 3: In this figure, we illustrate an example of how we extract a subnetwork with 4 heads in with a total number of 6 heads. In HydraViT, with the stochastic dropout training, we order the attention heads in and consequently their corresponding embedding vectors based on their importance.
  • Figure 4: An illustration of subnetwork extraction within and layers, introduced in HydraViT. Fig. \ref{['fig:MULT']} demonstrates how HydraViT slices activations, denoted as $A_{1}$ and $A_{2}$, along with their respective weight matrices, denoted as $W_{1}$ and $W_{2}$, based on the number of utilized heads. Also, Fig. \ref{['fig:NORM']} shows how HydraViT applies normalization on the activation corresponding to the used heads. For simplicity, only subnetworks with 3, 6, and 12 heads, corresponding to ViT-Ti, ViT-S, and ViT-B respectively, are presented.
  • Figure 5: Stochastic dropout training
  • ...and 7 more figures