Table of Contents
Fetching ...

Efficient Resource-Constrained Training of Vision Transformers via Subspace Optimization

Le-Trung Nguyen, Enzo Tartaglione, Van-Tam Nguyen

TL;DR

This work tackles the challenge of on-device fine-tuning for vision transformers under strict memory and compute limits. It introduces WASI, a framework that unifies Activation Subspace Iteration and Weight Subspace Iteration to train transformers in compact subspaces, thereby reducing activation and weight memory while preserving accuracy. The approach yields substantial practical gains, including memory reductions up to 62x and 1.5x speedups on edge hardware like the Raspberry Pi 5, and demonstrates generality across ViT, SwinT, and TinyLlama. By leveraging stable subspace representations, WASI enables privacy-preserving, energy-efficient edge AI for transformer models beyond CNNs.

Abstract

As AI increasingly shapes daily life, energy consumption and data privacy have become pressing concerns. On-device learning trains models directly on edge devices, cutting energy consumption and safeguarding data privacy. However, the expanding scale of modern neural networks creates a major obstacle for on-device training. Although prior work has concentrated on compact convolutional architectures, we instead apply subspace-based training to transformer models. Motivated by the idea that a model's essential information lies in a fixed subspace, we introduce Weight-Activation Subspace Iteration (WASI), a method that mitigates the memory bottleneck of backpropagation and boosts inference efficiency in transformer models by restricting training to this subspace. Our results demonstrate that WASI maintains accuracy comparable to vanilla training while reducing memory usage by up to $62\times$ and computational cost (FLOPs) by up to $2\times$. On a Raspberry Pi 5, WASI achieves roughly $1.5\times$ faster training and inference than vanilla training.

Efficient Resource-Constrained Training of Vision Transformers via Subspace Optimization

TL;DR

This work tackles the challenge of on-device fine-tuning for vision transformers under strict memory and compute limits. It introduces WASI, a framework that unifies Activation Subspace Iteration and Weight Subspace Iteration to train transformers in compact subspaces, thereby reducing activation and weight memory while preserving accuracy. The approach yields substantial practical gains, including memory reductions up to 62x and 1.5x speedups on edge hardware like the Raspberry Pi 5, and demonstrates generality across ViT, SwinT, and TinyLlama. By leveraging stable subspace representations, WASI enables privacy-preserving, energy-efficient edge AI for transformer models beyond CNNs.

Abstract

As AI increasingly shapes daily life, energy consumption and data privacy have become pressing concerns. On-device learning trains models directly on edge devices, cutting energy consumption and safeguarding data privacy. However, the expanding scale of modern neural networks creates a major obstacle for on-device training. Although prior work has concentrated on compact convolutional architectures, we instead apply subspace-based training to transformer models. Motivated by the idea that a model's essential information lies in a fixed subspace, we introduce Weight-Activation Subspace Iteration (WASI), a method that mitigates the memory bottleneck of backpropagation and boosts inference efficiency in transformer models by restricting training to this subspace. Our results demonstrate that WASI maintains accuracy comparable to vanilla training while reducing memory usage by up to and computational cost (FLOPs) by up to . On a Raspberry Pi 5, WASI achieves roughly faster training and inference than vanilla training.

Paper Structure

This paper contains 24 sections, 33 equations, 12 figures, 2 tables, 2 algorithms.

Figures (12)

  • Figure 1: Overview of WASI in a single training iteration.
  • Figure 2: For the linear layer $i$ with a single data batch of size $B$, given varying dimensions of $\mathcal{W}_i$ and $\mathcal{A}_i$ and different values of $\mathbf{r}_{i,m}$, $C_\text{training}$ and $C_\text{inference}$ illustrate the evolution in compression rates for training and inference, respectively; while $S_\text{training}$ and $S_\text{inference}$ forecast the speedup ratios for these processes.
  • Figure 3: When fine-tuning ViT on the Pets dataset, (a) illustrates the evolution of singular values of $\mathcal{W}_6$ across epochs; (b) compares WSI and full SVD in terms of accuracy and training FLOPs under varying explained variance thresholds, $\varepsilon \in \{0.4, 0.5, 0.6, 0.7, 0.8, 0.9\}$.
  • Figure 4: Explained variance of each singular value of $\mathcal{A}_i$ across all of its modes when fine-tuning ViT on the Pets dataset.
  • Figure 5: Resource consumption during fine-tuning and inference of ViT on the CIFAR-10 dataset. Each marker in the plots corresponds to a different compression rate, with the red diamond indicating vanilla training.
  • ...and 7 more figures