Table of Contents
Fetching ...

Learning Rate Matters: Vanilla LoRA May Suffice for LLM Fine-tuning

Yu-Ang Lee, Ching-Yun Ko, Pin-Yu Chen, Mi-Yen Yeh

TL;DR

The paper investigates whether advanced LoRA PEFT variants truly outperform vanilla LoRA or if their gains arise from suboptimal hyperparameter settings. By conducting thorough learning-rate sweeps and Hessian-based analyses across multiple decoder-only LLMs and tasks, the study shows that once the learning rate is properly tuned, all variants achieve near-identical peak performance, with modest rank-dependent differences. A second-order analysis reveals that optimal learning rates correlate with the largest Hessian eigenvalue $\lambda_{\max}$, explaining why different initialization strategies require different $\eta$ values. The findings advocate for rigorous hyperparameter exploration when evaluating PEFT methods and suggest that vanilla LoRA remains a competitive baseline, while offering guidance on when specific variants may yield marginal benefits in particular rank regimes. Overall, the work emphasizes that performance gains attributed to LoRA variants may reflect training configurations more than fundamental methodological advantages, guiding more reliable future comparisons in PEFT for LLM fine-tuning.

Abstract

Low-Rank Adaptation (LoRA) is the prevailing approach for efficient large language model (LLM) fine-tuning. Building on this paradigm, recent studies have proposed alternative initialization strategies and architectural modifications, reporting substantial improvements over vanilla LoRA. However, these gains are often demonstrated under fixed or narrowly tuned hyperparameter settings, despite the known sensitivity of neural networks to training configurations. In this work, we systematically re-evaluate four representative LoRA variants alongside vanilla LoRA through extensive hyperparameter searches. Across mathematical and code generation tasks on diverse model scales, we find that different LoRA methods favor distinct learning rate ranges. Crucially, once learning rates are properly tuned, all methods achieve similar peak performance (within 1-2%), with only subtle rank-dependent behaviors. These results suggest that vanilla LoRA remains a competitive baseline and that improvements reported under single training configuration may not reflect consistent methodological advantages. Finally, a second-order analysis attributes the differing optimal learning rate ranges to variations in the largest Hessian eigenvalue, aligning with classical learning theories.

Learning Rate Matters: Vanilla LoRA May Suffice for LLM Fine-tuning

TL;DR

The paper investigates whether advanced LoRA PEFT variants truly outperform vanilla LoRA or if their gains arise from suboptimal hyperparameter settings. By conducting thorough learning-rate sweeps and Hessian-based analyses across multiple decoder-only LLMs and tasks, the study shows that once the learning rate is properly tuned, all variants achieve near-identical peak performance, with modest rank-dependent differences. A second-order analysis reveals that optimal learning rates correlate with the largest Hessian eigenvalue , explaining why different initialization strategies require different values. The findings advocate for rigorous hyperparameter exploration when evaluating PEFT methods and suggest that vanilla LoRA remains a competitive baseline, while offering guidance on when specific variants may yield marginal benefits in particular rank regimes. Overall, the work emphasizes that performance gains attributed to LoRA variants may reflect training configurations more than fundamental methodological advantages, guiding more reliable future comparisons in PEFT for LLM fine-tuning.

Abstract

Low-Rank Adaptation (LoRA) is the prevailing approach for efficient large language model (LLM) fine-tuning. Building on this paradigm, recent studies have proposed alternative initialization strategies and architectural modifications, reporting substantial improvements over vanilla LoRA. However, these gains are often demonstrated under fixed or narrowly tuned hyperparameter settings, despite the known sensitivity of neural networks to training configurations. In this work, we systematically re-evaluate four representative LoRA variants alongside vanilla LoRA through extensive hyperparameter searches. Across mathematical and code generation tasks on diverse model scales, we find that different LoRA methods favor distinct learning rate ranges. Crucially, once learning rates are properly tuned, all methods achieve similar peak performance (within 1-2%), with only subtle rank-dependent behaviors. These results suggest that vanilla LoRA remains a competitive baseline and that improvements reported under single training configuration may not reflect consistent methodological advantages. Finally, a second-order analysis attributes the differing optimal learning rate ranges to variations in the largest Hessian eigenvalue, aligning with classical learning theories.
Paper Structure (50 sections, 10 equations, 17 figures, 12 tables, 2 algorithms)

This paper contains 50 sections, 10 equations, 17 figures, 12 tables, 2 algorithms.

Figures (17)

  • Figure 1: Performance of Qwen3-0.6B fine-tuned on mathematical reasoning tasks across learning rates. Different methods reach a similar performance level once the learning rate is properly tuned.
  • Figure 2: Frequency of advanced LoRA-based PEFT studies, categorized by whether learning rate or batch size tuning was applied and whether comparisons with vanilla LoRA across different ranks were conducted. Crucially, only one of the surveyed works simultaneously covered all three dimensions. Refer to Appendix Sec. \ref{['sec:prior_studies_hyperparameter']} for detailed data counts.
  • Figure 3: Overview of our considered PEFT methods: (a) Vanilla LoRA (Sec. \ref{['sec:lora']}), (b) three representative initialization variants (Sec. \ref{['sec:init_variants']}), and (c) one architectural modification (Sec. \ref{['sec:arch_modifications']}).
  • Figure 4: Performance of Llama-2-7B on mathematical reasoning and code generation tasks across varying learning rates ($r=128$, $B=128$). Notably, PiSSA peaks at lower learning rates but remains effective at larger learning rates on both tasks (e.g., 1.1$\times 10^{-3}$), where other methods diverge.
  • Figure 5: Best achievable performance of LoRA and its advanced variants across adapter ranks on Gemma-3-1B ($B=64$). With properly tuned learning rates, all methods exhibit similar performance improvement trends as the rank increases, though subtle rank-dependent behaviors emerge. Results are reported with means and standard deviations over three independent runs.
  • ...and 12 more figures