Table of Contents
Fetching ...

Adaptive Layer Selection for Efficient Vision Transformer Fine-Tuning

Alessio Devoto, Federico Alvetreti, Jary Pomponi, Paolo Di Lorenzo, Pasquale Minervini, Simone Scardapane

TL;DR

This work tackles the heavy cost of fine-tuning Vision Transformers by introducing Adaptive Layer Selective fine-Tuning (ALaST), which dynamically allocates a per-layer compute budget $b^i_l ∈ (0,1)$ at each training step. Budgets are updated using the class-token delta $ ext{Δ}^i_l = (CLS^i_l - CLS^i_{l-1})^2$ so that layers contributing more to the final prediction receive more compute, while two orthogonal levers—token discarding and layer freezing—control where computation occurs: adaptive token selection retains the top $b·N$ tokens based on CLS attention, and adaptive layer freezing trains only the $K$ highest-budget layers. Empirically, ALaST reduces FLOPs by up to 2×, memory by up to 2×, and training time by up to 1.5× with accuracy comparable to full fine-tuning, and it complements PEFT methods like LoRA without adding new parameters. The method is validated on Flower-102, CIFAR-100, and Food-101 across ViT-B, DeiT-S, and DeiT-T, demonstrating robust efficiency gains and practical potential for on-device fine-tuning and edge deployments.

Abstract

Recently, foundation models based on Vision Transformers (ViTs) have become widely available. However, their fine-tuning process is highly resource-intensive, and it hinders their adoption in several edge or low-energy applications. To this end, in this paper we introduce an efficient fine-tuning method for ViTs called $\textbf{ALaST}$ ($\textit{Adaptive Layer Selection Fine-Tuning for Vision Transformers}$) to speed up the fine-tuning process while reducing computational cost, memory load, and training time. Our approach is based on the observation that not all layers are equally critical during fine-tuning, and their importance varies depending on the current mini-batch. Therefore, at each fine-tuning step, we adaptively estimate the importance of all layers and we assign what we call ``compute budgets'' accordingly. Layers that were allocated lower budgets are either trained with a reduced number of input tokens or kept frozen. Freezing a layer reduces the computational cost and memory usage by preventing updates to its weights, while discarding tokens removes redundant data, speeding up processing and reducing memory requirements. We show that this adaptive compute allocation enables a nearly-optimal schedule for distributing computational resources across layers, resulting in substantial reductions in training time (up to 1.5x), FLOPs (up to 2x), and memory load (up to 2x) compared to traditional full fine-tuning approaches. Additionally, it can be successfully combined with other parameter-efficient fine-tuning methods, such as LoRA.

Adaptive Layer Selection for Efficient Vision Transformer Fine-Tuning

TL;DR

This work tackles the heavy cost of fine-tuning Vision Transformers by introducing Adaptive Layer Selective fine-Tuning (ALaST), which dynamically allocates a per-layer compute budget at each training step. Budgets are updated using the class-token delta so that layers contributing more to the final prediction receive more compute, while two orthogonal levers—token discarding and layer freezing—control where computation occurs: adaptive token selection retains the top tokens based on CLS attention, and adaptive layer freezing trains only the highest-budget layers. Empirically, ALaST reduces FLOPs by up to 2×, memory by up to 2×, and training time by up to 1.5× with accuracy comparable to full fine-tuning, and it complements PEFT methods like LoRA without adding new parameters. The method is validated on Flower-102, CIFAR-100, and Food-101 across ViT-B, DeiT-S, and DeiT-T, demonstrating robust efficiency gains and practical potential for on-device fine-tuning and edge deployments.

Abstract

Recently, foundation models based on Vision Transformers (ViTs) have become widely available. However, their fine-tuning process is highly resource-intensive, and it hinders their adoption in several edge or low-energy applications. To this end, in this paper we introduce an efficient fine-tuning method for ViTs called () to speed up the fine-tuning process while reducing computational cost, memory load, and training time. Our approach is based on the observation that not all layers are equally critical during fine-tuning, and their importance varies depending on the current mini-batch. Therefore, at each fine-tuning step, we adaptively estimate the importance of all layers and we assign what we call ``compute budgets'' accordingly. Layers that were allocated lower budgets are either trained with a reduced number of input tokens or kept frozen. Freezing a layer reduces the computational cost and memory usage by preventing updates to its weights, while discarding tokens removes redundant data, speeding up processing and reducing memory requirements. We show that this adaptive compute allocation enables a nearly-optimal schedule for distributing computational resources across layers, resulting in substantial reductions in training time (up to 1.5x), FLOPs (up to 2x), and memory load (up to 2x) compared to traditional full fine-tuning approaches. Additionally, it can be successfully combined with other parameter-efficient fine-tuning methods, such as LoRA.
Paper Structure (22 sections, 6 equations, 10 figures, 7 tables)

This paper contains 22 sections, 6 equations, 10 figures, 7 tables.

Figures (10)

  • Figure 1: At each fine-tuning step, we assign what we call "compute budgets" to transformer layers. The budget determines the computational resources we invest in each layer, i.e., (a) whether the layer is frozen or trainable and (b) how many tokens that layer can process. By adaptively allocating the budget, we make the fine-tuning faster and more efficient in terms of FLOPs, memory, and time.
  • Figure 2: Relative magnitude $\frac{|f(x)|}{|f(x)+x|}$ for each transformer layer in pre-trained DeiT-S (left) and ViT-B (right). Layers with low relative magnitudes (final ones for DeiT-S and middle ones for ViT-B) provide a minimal contribution to the residual token stream, working as identity functions.
  • Figure 3: At each fine-tuning iteration, each layer is assigned a compute budget. Based on the budget, we allow more (high budget) or fewer (low budget) tokens to flow through the layer - greyed out tokens are excluded from computation. Additionally, we freeze layers with the lowest budgets to save computing resources and memory.
  • Figure 4: Attention of CLS token for different patches at layer 2,4,6 of DeiT-S touvron2019deit. Brighter patches have higher attention. CLS token's attention captures semantically important patches.
  • Figure 5: Normalized improvement when fine-tuning with ALaST with respect to full fine-tuning for FLOPs, Memory, wall-clock time and accuracy. On average, we achieve similar accuracy, with $60 \%$ FLOPs, $50 \%$ memory and $80 \%$ time. We average the results on all the considered datasets for DeiT-S.
  • ...and 5 more figures