Table of Contents
Fetching ...

HRP: High-Rank Preheating for Superior LoRA Initialization

Yuzhu Chen, Yingjie Wang, Shi Fu, Li Shen, Yongcheng Jing, Xinmei Tian, Dacheng Tao

TL;DR

This work shows that LoRA fine-tuning is highly sensitive to initialization and that random schemes can prevent reaching the best low-rank approximation of the target change $M=W^{\text{target}}-W^{\text{init}}$. By analyzing gradient flow for Asymmetric and Classic LoRA, the authors prove that wise initialization yields exponential convergence to the optimal rank-$r$ solution, while random initialization can trap training in suboptimal regions. They propose High-Rank Preheating (HRP), which performs several steps of high-rank LoRA to approximate the main singular directions of $M$ via the BA$^\top$ product, then uses the leading singular vectors as the main initialization; theoretical bounds show HRP improves expected loss, especially when the target has low effective rank. Empirically, HRP improves performance over other initialization strategies on NLU and NLG tasks and achieves results comparable to full-parameter fine-tuning with negligible extra memory, validating its practicality for resource-constrained fine-tuning of large models.

Abstract

This paper studies the crucial impact of initialization in Low-Rank Adaptation (LoRA). Through theoretical analysis, we demonstrate that the fine-tuned result of LoRA is highly sensitive to initialization, which is likely to lead suboptimal low-rank results. While this issue can be mitigated by adjusting the initial direction towards the main singular vectors of the target $ΔW$, which is, however, typically unknown in real-world scenarios. To approximate this initial direction, we propose High-Rank Preheating (HRP), which first trains LoRA with a higher preheating rank for a few steps, then uses the main singular vectors of the derived $BA^\top$ as initialization for the main fine-tuning process. With only a modification in the initial direction, we prove that HRP makes LoRA achieve better fine-tuned results than random initialization in expectation, and the enhancement grows with the preheating rank. We validate our theoretical findings through extensive experiments in various models and tasks, where HRP significantly enhances LoRA's effectiveness and outperforms other initialization strategies and other LoRA variants.

HRP: High-Rank Preheating for Superior LoRA Initialization

TL;DR

This work shows that LoRA fine-tuning is highly sensitive to initialization and that random schemes can prevent reaching the best low-rank approximation of the target change . By analyzing gradient flow for Asymmetric and Classic LoRA, the authors prove that wise initialization yields exponential convergence to the optimal rank- solution, while random initialization can trap training in suboptimal regions. They propose High-Rank Preheating (HRP), which performs several steps of high-rank LoRA to approximate the main singular directions of via the BA product, then uses the leading singular vectors as the main initialization; theoretical bounds show HRP improves expected loss, especially when the target has low effective rank. Empirically, HRP improves performance over other initialization strategies on NLU and NLG tasks and achieves results comparable to full-parameter fine-tuning with negligible extra memory, validating its practicality for resource-constrained fine-tuning of large models.

Abstract

This paper studies the crucial impact of initialization in Low-Rank Adaptation (LoRA). Through theoretical analysis, we demonstrate that the fine-tuned result of LoRA is highly sensitive to initialization, which is likely to lead suboptimal low-rank results. While this issue can be mitigated by adjusting the initial direction towards the main singular vectors of the target , which is, however, typically unknown in real-world scenarios. To approximate this initial direction, we propose High-Rank Preheating (HRP), which first trains LoRA with a higher preheating rank for a few steps, then uses the main singular vectors of the derived as initialization for the main fine-tuning process. With only a modification in the initial direction, we prove that HRP makes LoRA achieve better fine-tuned results than random initialization in expectation, and the enhancement grows with the preheating rank. We validate our theoretical findings through extensive experiments in various models and tasks, where HRP significantly enhances LoRA's effectiveness and outperforms other initialization strategies and other LoRA variants.

Paper Structure

This paper contains 28 sections, 16 theorems, 82 equations, 2 figures, 6 tables, 1 algorithm.

Key Result

Theorem 1

Consider Asymmetric LoRA under objective matrix-factorization in gradient flow abflow with the frozen adapter from Gaussian initialization or orthogonal initialization. For LSI and RSI, we have for all $t>0$: where $\mathbb{E}$ represents the expectation with respect to randomness in initialization. The inequality becomes an equality when $t\to\infty$.

Figures (2)

  • Figure 1: Loss curves for matrix factorization targeting $M=\operatorname{diag}(I_{12},O_{20\times 20})$ with $r=2$. Left: classic LoRA in Gaussian initialization, orthogonal initialization, HRP derived initialization with $\operatorname{hrp\_rank}=6$, and target initialization (suggested in Theorem \ref{['asym-wise']}). Right: Asymmetric LoRA in the same initialization strategies.
  • Figure 2: Loss curves for fine-tuning meta-llama/Llama-3.2-1B-Instruct on the MetaMathQA.

Theorems & Definitions (26)

  • Theorem 1
  • Theorem 2
  • Theorem 3
  • Theorem 4
  • Theorem 5
  • Theorem 6
  • Lemma 1
  • proof
  • Lemma 2
  • proof
  • ...and 16 more