Table of Contents
Fetching ...

LoRA-DA: Data-Aware Initialization for Low-Rank Adaptation via Asymptotic Analysis

Qingyue Zhang, Chang Chu, Tianren Peng, Qi Li, Xiangyang Luo, Zhihao Jiang, Shao-Lun Huang

TL;DR

LoRA-DA addresses initialization in LoRA by deriving a data-aware initialization via asymptotic analysis. The method splits the fine-tuning error into a variance term driven by the Fisher information and a bias term captured by Fisher-gradient, and solves a quadratic program to obtain the initialization subspace for A and B. It operationalizes this via a small target-domain sample set, using K-FAC to estimate the Fisher factors and LOBPCG to compute eigenvectors. Empirically, LoRA-DA improves final accuracy on NLU and NLG benchmarks, accelerates convergence, and maintains a modest initialization overhead compared with gradient-based baselines.

Abstract

With the widespread adoption of LLMs, LoRA has become a dominant method for PEFT, and its initialization methods have attracted increasing attention. However, existing methods have notable limitations: many methods do not incorporate target-domain data, while gradient-based methods exploit data only at a shallow level by relying on one-step gradient decomposition, which remains unsatisfactory due to the weak empirical performance of the one-step fine-tuning model that serves as their basis, as well as the fact that these methods either lack a rigorous theoretical foundation or depend heavily on restrictive isotropic assumptions. In this paper, we establish a theoretical framework for data-aware LoRA initialization based on asymptotic analysis. Starting from a general optimization objective that minimizes the expectation of the parameter discrepancy between the fine-tuned and target models, we derive an optimization problem with two components: a bias term, which is related to the parameter distance between the fine-tuned and target models, and is approximated using a Fisher-gradient formulation to preserve anisotropy; and a variance term, which accounts for the uncertainty introduced by sampling stochasticity through the Fisher information. By solving this problem, we obtain an optimal initialization strategy for LoRA. Building on this theoretical framework, we develop an efficient algorithm, LoRA-DA, which estimates the terms in the optimization problem from a small set of target domain samples and obtains the optimal LoRA initialization. Empirical results across multiple benchmarks demonstrate that LoRA-DA consistently improves final accuracy over existing initialization methods. Additional studies show faster, more stable convergence, robustness across ranks, and only a small initialization overhead for LoRA-DA. The source code will be released upon publication.

LoRA-DA: Data-Aware Initialization for Low-Rank Adaptation via Asymptotic Analysis

TL;DR

LoRA-DA addresses initialization in LoRA by deriving a data-aware initialization via asymptotic analysis. The method splits the fine-tuning error into a variance term driven by the Fisher information and a bias term captured by Fisher-gradient, and solves a quadratic program to obtain the initialization subspace for A and B. It operationalizes this via a small target-domain sample set, using K-FAC to estimate the Fisher factors and LOBPCG to compute eigenvectors. Empirically, LoRA-DA improves final accuracy on NLU and NLG benchmarks, accelerates convergence, and maintains a modest initialization overhead compared with gradient-based baselines.

Abstract

With the widespread adoption of LLMs, LoRA has become a dominant method for PEFT, and its initialization methods have attracted increasing attention. However, existing methods have notable limitations: many methods do not incorporate target-domain data, while gradient-based methods exploit data only at a shallow level by relying on one-step gradient decomposition, which remains unsatisfactory due to the weak empirical performance of the one-step fine-tuning model that serves as their basis, as well as the fact that these methods either lack a rigorous theoretical foundation or depend heavily on restrictive isotropic assumptions. In this paper, we establish a theoretical framework for data-aware LoRA initialization based on asymptotic analysis. Starting from a general optimization objective that minimizes the expectation of the parameter discrepancy between the fine-tuned and target models, we derive an optimization problem with two components: a bias term, which is related to the parameter distance between the fine-tuned and target models, and is approximated using a Fisher-gradient formulation to preserve anisotropy; and a variance term, which accounts for the uncertainty introduced by sampling stochasticity through the Fisher information. By solving this problem, we obtain an optimal initialization strategy for LoRA. Building on this theoretical framework, we develop an efficient algorithm, LoRA-DA, which estimates the terms in the optimization problem from a small set of target domain samples and obtains the optimal LoRA initialization. Empirical results across multiple benchmarks demonstrate that LoRA-DA consistently improves final accuracy over existing initialization methods. Additional studies show faster, more stable convergence, robustness across ranks, and only a small initialization overhead for LoRA-DA. The source code will be released upon publication.

Paper Structure

This paper contains 33 sections, 4 theorems, 47 equations, 4 figures, 7 tables, 1 algorithm.

Key Result

Theorem 1

(proved in Appendix appendix:one_dimensional) In the case where the output dimension $d_{2}=1$, the optimal initialization of the matrix $A$ is given by and from Section Eigenvalue_theorem we know that the $r$ column vectors of $\bm{A_0}^*$ correspond to the eigenvectors of $\bm{\bm{\Omega}}$ associated with its $r$ smallest eigenvalues, where $\bm{\Omega}$ is the Initialization Guidance Matrix g

Figures (4)

  • Figure 1: The yellow circle illustrates the estimation variance induced by the stochasticity of training samples in the unconstrained setting. The red variance term represents its projection onto the LoRA subspace under the fixed-$\bm{A}$ constraint, while the red bias term corresponds to the approximation error due to the distance between $\bm{W}_{\text{tgt}}$ and the LoRA subspace.
  • Figure 2: The loss, grad norm, and evaluation accuracy on GSM8K over the training steps of LoRA (indicated in yellow), LoRA-One (in red), and LoRA-DA (in blue)
  • Figure 3: Loss landscape.
  • Figure 4: Accuracy of LoRA-DA across different ranks on the GSM8K task.

Theorems & Definitions (7)

  • Theorem 1
  • Theorem 2
  • Remark 3
  • Lemma 4
  • proof
  • Lemma 5
  • proof