Table of Contents
Fetching ...

AdaGradSelect: An adaptive gradient-guided layer selection method for efficient fine-tuning of SLMs

Anshul Kumar, Gagan Raj Gupta, Manisha Chawla

TL;DR

AdaGradSelect tackles the high cost of fine-tuning by adaptively selecting which transformer blocks to update based on gradient history and past update frequencies. By combining Dirichlet-based sampling with an epsilon-greedy exploration, it effectively concentrates training on a small, high-impact subset of blocks, yielding substantial speedups and memory savings while preserving near full-fine-tuning performance. Across three small-model families and two reasoning benchmarks, it outperforms LoRA on GSM8K and maintains accuracy on MATH, with notable improvements in training speed (≈12%) and GPU memory usage (≈35%). This approach provides a practical, scalable alternative for efficient fine-tuning in resource-constrained environments, particularly for Small Language Models.

Abstract

Large Language Models (LLMs) can perform many NLP tasks well, but fully fine-tuning them is expensive and requires a lot of memory. Parameter-Efficient Fine-Tuning (PEFT) methods such as LoRA reduce this cost by adding small low-rank updates to frozen model weights. However, these methods restrict the training to a limited subspace, which can sometimes reduce performance. For Small Language Models (SLMs), where efficiency gains matter even more, we introduce AdaGradSelect, an adaptive method that selects which transformer blocks to update based on gradients. Early observations showed that updating only the transformer blocks with the highest gradient norms can achieve performance close to full fine-tuning. Building on this insight, AdaGradSelect adaptively chooses which blocks to train. It uses a combination of Dirichlet-based sampling, which depends on how frequently blocks were updated in the past, and an epsilon-greedy exploration strategy. This lets the method explore different blocks in early training and gradually focus on the most important ones in later epochs. Experiments show that AdaGradSelect trains about 12 percent faster and uses 35 percent less GPU memory while delivering performance very close to full fine-tuning. On the GSM8K dataset, it outperforms LoRA (rank 256) by about 3 percent on average across models such as Qwen2.5-0.5B, LLaMA3.2-1B, and Phi4-mini-3.8B. It also achieves similar accuracy on the MATH dataset. Overall, AdaGradSelect provides a more effective and resource-efficient alternative to traditional fine-tuning methods.

AdaGradSelect: An adaptive gradient-guided layer selection method for efficient fine-tuning of SLMs

TL;DR

AdaGradSelect tackles the high cost of fine-tuning by adaptively selecting which transformer blocks to update based on gradient history and past update frequencies. By combining Dirichlet-based sampling with an epsilon-greedy exploration, it effectively concentrates training on a small, high-impact subset of blocks, yielding substantial speedups and memory savings while preserving near full-fine-tuning performance. Across three small-model families and two reasoning benchmarks, it outperforms LoRA on GSM8K and maintains accuracy on MATH, with notable improvements in training speed (≈12%) and GPU memory usage (≈35%). This approach provides a practical, scalable alternative for efficient fine-tuning in resource-constrained environments, particularly for Small Language Models.

Abstract

Large Language Models (LLMs) can perform many NLP tasks well, but fully fine-tuning them is expensive and requires a lot of memory. Parameter-Efficient Fine-Tuning (PEFT) methods such as LoRA reduce this cost by adding small low-rank updates to frozen model weights. However, these methods restrict the training to a limited subspace, which can sometimes reduce performance. For Small Language Models (SLMs), where efficiency gains matter even more, we introduce AdaGradSelect, an adaptive method that selects which transformer blocks to update based on gradients. Early observations showed that updating only the transformer blocks with the highest gradient norms can achieve performance close to full fine-tuning. Building on this insight, AdaGradSelect adaptively chooses which blocks to train. It uses a combination of Dirichlet-based sampling, which depends on how frequently blocks were updated in the past, and an epsilon-greedy exploration strategy. This lets the method explore different blocks in early training and gradually focus on the most important ones in later epochs. Experiments show that AdaGradSelect trains about 12 percent faster and uses 35 percent less GPU memory while delivering performance very close to full fine-tuning. On the GSM8K dataset, it outperforms LoRA (rank 256) by about 3 percent on average across models such as Qwen2.5-0.5B, LLaMA3.2-1B, and Phi4-mini-3.8B. It also achieves similar accuracy on the MATH dataset. Overall, AdaGradSelect provides a more effective and resource-efficient alternative to traditional fine-tuning methods.

Paper Structure

This paper contains 20 sections, 10 equations, 4 figures, 1 table, 2 algorithms.

Figures (4)

  • Figure 1: Comparison of training time vs Avg GPU usage for training Qwen2.5 0.5B using different methods
  • Figure 2: Illustration of AdaGradSelect’s selective block update strategy. During fine-tuning, only a subset of transformer blocks (green) are updated, while others remain frozen (red). In the first epoch, an $\epsilon$-greedy strategy enables exploration-exploitation as shown, with $\epsilon$ decaying exponentially. At first step there will always be exploration and at Nth step there will always be Exploitatio After Epoch 1, the method transitions fully to exploitation.
  • Figure 3: Comparison of Accuracy vs Percentage of Qwen2.5 0.5B Transformer Blocks Selected
  • Figure 4: Loss convergence of Qwen2.5 0.5B on MetaMath40K for AdaGradSelect(10-30%) and other methods