Table of Contents
Fetching ...

The Primacy of Magnitude in Low-Rank Adaptation

Zicheng Zhang, Haoran Li, Yifeng Zhang, Guoqiang Gong, Jiaxing Wang, Junxing Hu, Pengzhang Liu, Qixia Jiang

TL;DR

This work reframes LoRA training dynamics around the update-magnitude of weight changes, showing that magnitude controls convergence and expressiveness. It proves that low-rank structure inherently limits update magnitudes and that spectral initializations boost performance primarily by amplifying updates, not by embedding knowledge. To preserve efficiency while achieving spectral gains, the authors propose LoRAM, a magnitude-driven initialization using deterministic orthogonal bases (DST) scaled by pretrained weight statistics, eliminating SVD overhead. Empirical results across NLP and vision-language benchmarks demonstrate that LoRAM matches or surpasses spectral methods while maintaining LoRA’s parameter, memory, and compute efficiency. This introduces a unifying perspective that connects learning-rate, scaling, and initialization through the lens of update magnitude, with practical implications for robust, scalable PEFT deployment.

Abstract

Low-Rank Adaptation (LoRA) offers a parameter-efficient paradigm for tuning large models. While recent spectral initialization methods improve convergence and performance over the naive "Noise & Zeros" scheme, their extra computational and storage overhead undermines efficiency. In this paper, we establish update magnitude as the fundamental driver of LoRA performance and propose LoRAM, a magnitude-driven "Basis & Basis" initialization scheme that matches spectral methods without their inefficiencies. Our key contributions are threefold: (i) Magnitude of weight updates determines convergence. We prove low-rank structures intrinsically bound update magnitudes, unifying hyperparameter tuning in learning rate, scaling factor, and initialization as mechanisms to optimize magnitude regulation. (ii) Spectral initialization succeeds via magnitude amplification. We demystify that the presumed knowledge-driven benefit of the spectral component essentially arises from the boost in the weight update magnitude. (iii) A novel and compact initialization strategy, LoRAM, scales deterministic orthogonal bases using pretrained weight magnitudes to simulate spectral gains. Extensive experiments show that LoRAM serves as a strong baseline, retaining the full efficiency of LoRA while matching or outperforming spectral initialization across benchmarks.

The Primacy of Magnitude in Low-Rank Adaptation

TL;DR

This work reframes LoRA training dynamics around the update-magnitude of weight changes, showing that magnitude controls convergence and expressiveness. It proves that low-rank structure inherently limits update magnitudes and that spectral initializations boost performance primarily by amplifying updates, not by embedding knowledge. To preserve efficiency while achieving spectral gains, the authors propose LoRAM, a magnitude-driven initialization using deterministic orthogonal bases (DST) scaled by pretrained weight statistics, eliminating SVD overhead. Empirical results across NLP and vision-language benchmarks demonstrate that LoRAM matches or surpasses spectral methods while maintaining LoRA’s parameter, memory, and compute efficiency. This introduces a unifying perspective that connects learning-rate, scaling, and initialization through the lens of update magnitude, with practical implications for robust, scalable PEFT deployment.

Abstract

Low-Rank Adaptation (LoRA) offers a parameter-efficient paradigm for tuning large models. While recent spectral initialization methods improve convergence and performance over the naive "Noise & Zeros" scheme, their extra computational and storage overhead undermines efficiency. In this paper, we establish update magnitude as the fundamental driver of LoRA performance and propose LoRAM, a magnitude-driven "Basis & Basis" initialization scheme that matches spectral methods without their inefficiencies. Our key contributions are threefold: (i) Magnitude of weight updates determines convergence. We prove low-rank structures intrinsically bound update magnitudes, unifying hyperparameter tuning in learning rate, scaling factor, and initialization as mechanisms to optimize magnitude regulation. (ii) Spectral initialization succeeds via magnitude amplification. We demystify that the presumed knowledge-driven benefit of the spectral component essentially arises from the boost in the weight update magnitude. (iii) A novel and compact initialization strategy, LoRAM, scales deterministic orthogonal bases using pretrained weight magnitudes to simulate spectral gains. Extensive experiments show that LoRAM serves as a strong baseline, retaining the full efficiency of LoRA while matching or outperforming spectral initialization across benchmarks.

Paper Structure

This paper contains 38 sections, 7 theorems, 72 equations, 7 figures, 8 tables, 1 algorithm.

Key Result

Proposition 1

For LoRA layers defined in Eq. eq:lora, consider decomposing the scaling factor $\alpha = \alpha' \alpha_A \alpha_B$, where $\alpha', \alpha_A, \alpha_B \in \mathbb{R}^+$. Under the commonly used optimization frameworks with negligible numerical errors, the following parametrization schemes exhibit

Figures (7)

  • Figure 1: We propose LoRAM, a magnitude-driven initialization method that enhances both the convergence and performance of LoRA while maintaining its efficiency. Unlike spectral initialization, which precomputes and stores singular components ($U, V, S$)PiSSA, LoRAM uses deterministic orthogonal bases and derives scaling from pretrained weight statistics. This elegant simplification is grounded in our analysis of LoRA through a novel lens of magnitude dynamics, where we show that the benefits of spectral values in scaling weight update magnitude can be effectively approximated.
  • Figure 2: (a) Validation of Proposition \ref{['prop: Parameter Scaling Equivalence']}. Each curve represents a model with unique hyperparameters. The norm difference (right axis) aggregates Frobenius norm discrepancies between the baseline model (black) and others across layers. Purple and other curves share identical learning rates but diverge due to differing initialization magnitudes. Equivalent optimization trajectories emerge from diverse hyperparameter combinations under both SGD and Adam optimizers. (b) Validation of Proposition \ref{['prop:Parameter Magnitude Dynamics brief']}. The black curve represents random orthogonal initialization. Parameter magnitudes are predominantly governed by initialization scaling, resulting in smaller weight changes compared to conventional linear layers. This necessitates the magnitude scaling in enhancing LoRA performance.
  • Figure 3: Illustration of spectral gain factor $Q[r]$ defined in Eq. \ref{['eq:spectral gain factor']} and spectral concentration factor $\rho[r]$ defined in Eq. \ref{['eq:spectral concentration factor']} across DeBERTa-v3-base he2021debertav3, LLaMA-2-7B touvron2023llama and FLUX.1-12B flux2024. Values are computed from uniformly sampled layers. The white dotted line represents the linear growth rate of naive LoRA weight magnitudes, while spectral initialization exhibits faster growth. Due to its concave nature, we approximate the spectral gain factor using a logarithmic function.
  • Figure 4: Comparison of LoRA, PiSSA, and LoRAM on image customization task. Experiments conducted with the state-of-the-art FLUX.1-12B model using rank 8.
  • Figure 5: Illustration of training loss curves. LoRAM achieves comparative convergence dynamics to PiSSA across diverse models and benchmarks. See tables and texts for the evaluation results.
  • ...and 2 more figures

Theorems & Definitions (14)

  • Proposition 1: Parameter Scaling Equivalence
  • Proposition 2: Parameter Magnitude Dynamics
  • Proposition 1: Lower Bound on Representation Error
  • proof
  • Proposition 2: Parameter Scaling Equivalence
  • proof
  • Proposition 3: Parameter Magnitude Dynamics
  • proof
  • Proposition 4: Linearized Dynamics Approximation
  • proof
  • ...and 4 more