Table of Contents
Fetching ...

Learning Rate Scaling across LoRA Ranks and Transfer to Full Finetuning

Nan Chen, Soledad Villar, Soufiane Hayou

TL;DR

This work addresses the brittle learning-rate tuning problem in LoRA-based parameter-efficient finetuning by introducing Maximal-Update Adaptation ($\mu$A), which derives how the optimal learning rate should scale with model width $n$ and LoRA rank $r$ to ensure stable yet non-vanishing feature updates. By analyzing the joint limit $n,r\to\infty$ under Init[A] and Init[B], the authors reveal two regimes: one where $\eta$ scales inversely with rank and another where $\eta$ is largely rank-invariant, with a configuration that matches FFT scaling and enables learning-rate transfer from LoRA to FFT. The theory is validated through extensive experiments across supervised finetuning, reinforcement learning with verifiable rewards, vision-language tasks, diffusion models, and RL contexts, showing that the predicted scaling rules hold broadly and that LoRA-tuned learning rates can transfer to FFT in practice. The practical impact is a principled workflow to select learning rates using LoRA configurations, dramatically reducing hyperparameter tuning costs while enabling cross-regime transfer of hyperparameters. Overall, $\mu$A provides a unified understanding of LoRA dynamics and offers concrete, transfer-friendly guidelines for scalable fine-tuning of large models.

Abstract

Low-Rank Adaptation (LoRA) is a standard tool for parameter-efficient finetuning of large models. While it induces a small memory footprint, its training dynamics can be surprisingly complex as they depend on several hyperparameters such as initialization, adapter rank, and learning rate. In particular, it is unclear how the optimal learning rate scales with adapter rank, which forces practitioners to re-tune the learning rate whenever the rank is changed. In this paper, we introduce Maximal-Update Adaptation ($μ$A), a theoretical framework that characterizes how the "optimal" learning rate should scale with model width and adapter rank to produce stable, non-vanishing feature updates under standard configurations. $μ$A is inspired from the Maximal-Update Parametrization ($μ$P) in pretraining. Our analysis leverages techniques from hyperparameter transfer and reveals that the optimal learning rate exhibits different scaling patterns depending on initialization and LoRA scaling factor. Specifically, we identify two regimes: one where the optimal learning rate remains roughly invariant across ranks, and another where it scales inversely with rank. We further identify a configuration that allows learning rate transfer from LoRA to full finetuning, drastically reducing the cost of learning rate tuning for full finetuning. Experiments across language, vision, vision--language, image generation, and reinforcement learning tasks validate our scaling rules and show that learning rates tuned on LoRA transfer reliably to full finetuning.

Learning Rate Scaling across LoRA Ranks and Transfer to Full Finetuning

TL;DR

This work addresses the brittle learning-rate tuning problem in LoRA-based parameter-efficient finetuning by introducing Maximal-Update Adaptation (A), which derives how the optimal learning rate should scale with model width and LoRA rank to ensure stable yet non-vanishing feature updates. By analyzing the joint limit under Init[A] and Init[B], the authors reveal two regimes: one where scales inversely with rank and another where is largely rank-invariant, with a configuration that matches FFT scaling and enables learning-rate transfer from LoRA to FFT. The theory is validated through extensive experiments across supervised finetuning, reinforcement learning with verifiable rewards, vision-language tasks, diffusion models, and RL contexts, showing that the predicted scaling rules hold broadly and that LoRA-tuned learning rates can transfer to FFT in practice. The practical impact is a principled workflow to select learning rates using LoRA configurations, dramatically reducing hyperparameter tuning costs while enabling cross-regime transfer of hyperparameters. Overall, A provides a unified understanding of LoRA dynamics and offers concrete, transfer-friendly guidelines for scalable fine-tuning of large models.

Abstract

Low-Rank Adaptation (LoRA) is a standard tool for parameter-efficient finetuning of large models. While it induces a small memory footprint, its training dynamics can be surprisingly complex as they depend on several hyperparameters such as initialization, adapter rank, and learning rate. In particular, it is unclear how the optimal learning rate scales with adapter rank, which forces practitioners to re-tune the learning rate whenever the rank is changed. In this paper, we introduce Maximal-Update Adaptation (A), a theoretical framework that characterizes how the "optimal" learning rate should scale with model width and adapter rank to produce stable, non-vanishing feature updates under standard configurations. A is inspired from the Maximal-Update Parametrization (P) in pretraining. Our analysis leverages techniques from hyperparameter transfer and reveals that the optimal learning rate exhibits different scaling patterns depending on initialization and LoRA scaling factor. Specifically, we identify two regimes: one where the optimal learning rate remains roughly invariant across ranks, and another where it scales inversely with rank. We further identify a configuration that allows learning rate transfer from LoRA to full finetuning, drastically reducing the cost of learning rate tuning for full finetuning. Experiments across language, vision, vision--language, image generation, and reinforcement learning tasks validate our scaling rules and show that learning rates tuned on LoRA transfer reliably to full finetuning.
Paper Structure (76 sections, 18 theorems, 103 equations, 17 figures, 8 tables)

This paper contains 76 sections, 18 theorems, 103 equations, 17 figures, 8 tables.

Key Result

Lemma 3.6

Fix $t\le T$. If $Z_B^0 = \mathcal{O}(1)$ and $\Delta Z_B^s = \mathcal{O}(1)$ for all $s\le t$, then $Z_B^t = \mathcal{O}(1)$.

Figures (17)

  • Figure 1: LoRA finetuning with standard design choices.
  • Figure 2: Learning rate sweeps for three standard LoRA configurations on Llama-3.2-1B (Tulu3). Each panel plots final training loss (EMA-smoothed) versus log-scale learning rate $\log_2(\eta)$ for multiple ranks (colored) and FFT (red). Large markers indicate the optimal learning rate for each rank; the dashed red line marks the optimal FFT learning rate. The y-axis is clipped for readability.
  • Figure 3: Learning rate sweeps for Init[A] with $\alpha=1$ across four models. Curves show final training loss (EMA-smoothed) versus log-scale learning rate ($\log_2(\eta)$). Large markers denote per-rank optima. The Y-axis is clipped for readability.
  • Figure 4: Learning rate sweeps for Init[A] with rank-dependent $\alpha=r^{-1}$ across four models. Curves show final training loss (EMA-smoothed) versus log-scale learning rate ($\log_2(\eta)$). Large markers denote per-rank optima. The Y-axis is clipped for readability.
  • Figure 5: Learning rate sweeps for Init[B] with $\alpha=1$ across four models. Curves show final training loss (EMA-smoothed) versus log-scale learning rate ($\log_2(\eta)$) for multiple ranks (green) and FFT (red). Large markers denote per-rank optima. The Y-axis is clipped for readability. The red vertical dashed line marks the optimal FFT learning rate.
  • ...and 12 more figures

Theorems & Definitions (45)

  • Definition 3.1: Low-Rank Adaptation (LoRA)
  • Definition 3.2: Initialization schemes for LoRA
  • Definition 3.3: LoRA features
  • Definition 3.4: Asymptotic notation
  • Definition 3.5: Feature stability
  • Lemma 3.6: Telescoping bound
  • Definition 3.7: Stable feature learning with LoRA
  • Theorem 4.3: Init[A]
  • Corollary 4.4: $\mu$A scaling rules for Init[A]
  • Theorem 4.5: Init[B]
  • ...and 35 more