Adaptive Layer Selection for Efficient Vision Transformer Fine-Tuning
Alessio Devoto, Federico Alvetreti, Jary Pomponi, Paolo Di Lorenzo, Pasquale Minervini, Simone Scardapane
TL;DR
This work tackles the heavy cost of fine-tuning Vision Transformers by introducing Adaptive Layer Selective fine-Tuning (ALaST), which dynamically allocates a per-layer compute budget $b^i_l ∈ (0,1)$ at each training step. Budgets are updated using the class-token delta $ ext{Δ}^i_l = (CLS^i_l - CLS^i_{l-1})^2$ so that layers contributing more to the final prediction receive more compute, while two orthogonal levers—token discarding and layer freezing—control where computation occurs: adaptive token selection retains the top $b·N$ tokens based on CLS attention, and adaptive layer freezing trains only the $K$ highest-budget layers. Empirically, ALaST reduces FLOPs by up to 2×, memory by up to 2×, and training time by up to 1.5× with accuracy comparable to full fine-tuning, and it complements PEFT methods like LoRA without adding new parameters. The method is validated on Flower-102, CIFAR-100, and Food-101 across ViT-B, DeiT-S, and DeiT-T, demonstrating robust efficiency gains and practical potential for on-device fine-tuning and edge deployments.
Abstract
Recently, foundation models based on Vision Transformers (ViTs) have become widely available. However, their fine-tuning process is highly resource-intensive, and it hinders their adoption in several edge or low-energy applications. To this end, in this paper we introduce an efficient fine-tuning method for ViTs called $\textbf{ALaST}$ ($\textit{Adaptive Layer Selection Fine-Tuning for Vision Transformers}$) to speed up the fine-tuning process while reducing computational cost, memory load, and training time. Our approach is based on the observation that not all layers are equally critical during fine-tuning, and their importance varies depending on the current mini-batch. Therefore, at each fine-tuning step, we adaptively estimate the importance of all layers and we assign what we call ``compute budgets'' accordingly. Layers that were allocated lower budgets are either trained with a reduced number of input tokens or kept frozen. Freezing a layer reduces the computational cost and memory usage by preventing updates to its weights, while discarding tokens removes redundant data, speeding up processing and reducing memory requirements. We show that this adaptive compute allocation enables a nearly-optimal schedule for distributing computational resources across layers, resulting in substantial reductions in training time (up to 1.5x), FLOPs (up to 2x), and memory load (up to 2x) compared to traditional full fine-tuning approaches. Additionally, it can be successfully combined with other parameter-efficient fine-tuning methods, such as LoRA.
