Efficient Resource-Constrained Training of Vision Transformers via Subspace Optimization
Le-Trung Nguyen, Enzo Tartaglione, Van-Tam Nguyen
TL;DR
This work tackles the challenge of on-device fine-tuning for vision transformers under strict memory and compute limits. It introduces WASI, a framework that unifies Activation Subspace Iteration and Weight Subspace Iteration to train transformers in compact subspaces, thereby reducing activation and weight memory while preserving accuracy. The approach yields substantial practical gains, including memory reductions up to 62x and 1.5x speedups on edge hardware like the Raspberry Pi 5, and demonstrates generality across ViT, SwinT, and TinyLlama. By leveraging stable subspace representations, WASI enables privacy-preserving, energy-efficient edge AI for transformer models beyond CNNs.
Abstract
As AI increasingly shapes daily life, energy consumption and data privacy have become pressing concerns. On-device learning trains models directly on edge devices, cutting energy consumption and safeguarding data privacy. However, the expanding scale of modern neural networks creates a major obstacle for on-device training. Although prior work has concentrated on compact convolutional architectures, we instead apply subspace-based training to transformer models. Motivated by the idea that a model's essential information lies in a fixed subspace, we introduce Weight-Activation Subspace Iteration (WASI), a method that mitigates the memory bottleneck of backpropagation and boosts inference efficiency in transformer models by restricting training to this subspace. Our results demonstrate that WASI maintains accuracy comparable to vanilla training while reducing memory usage by up to $62\times$ and computational cost (FLOPs) by up to $2\times$. On a Raspberry Pi 5, WASI achieves roughly $1.5\times$ faster training and inference than vanilla training.
