Riemannian Preconditioned LoRA for Fine-Tuning Foundation Models

Fangzhao Zhang; Mert Pilanci

Riemannian Preconditioned LoRA for Fine-Tuning Foundation Models

Fangzhao Zhang, Mert Pilanci

TL;DR

This work introduces a lightweight $r\times r$ Riemannian preconditioner for LoRA-based fine-tuning of foundation models, deriving the updates from a novel metric on the low-rank quotient manifold. The preconditioned updates $A_{t+1}=A_t-\alpha(\nabla_{A_t}\mathcal{L})(B_t^TB_t)^{-1}$ and $B_{t+1}=B_t-\alpha (A_tA_t^T)^{-1}(\nabla_{B_t}\mathcal{L})$ project gradients onto the row and column subspaces, effectively stabilizing feature learning in the infinite-width limit and removing the need for separate learning-rate tuning for $A$ and $B$. Empirically, scaled GD/AdamW outperform unscaled optimizers across GPT-2, Mistral 7B, Stable Diffusion, and Mix-of-Show diffusion models, with negligible runtime overhead and improved LR-robustness. Theoretical results establish convergence of the scaled method for a reparameterized two-layer ReLU network, with a rate that does not depend on data conditioning under suitable initialization. Overall, the approach offers a practical, scalable enhancement to PEFT fine-tuning with strong theoretical backing and broad empirical validation.

Abstract

Low-Rank Adaptation (LoRA) emerges as a popular parameter-efficient fine-tuning (PEFT) method, which proposes to freeze pretrained model weights and update an additive low-rank trainable matrix. In this work, we study the enhancement of LoRA training by introducing an $r \times r$ preconditioner in each gradient step where $r$ is the LoRA rank. We theoretically verify that the proposed preconditioner stabilizes feature learning with LoRA under infinite-width NN setting. Empirically, the implementation of this new preconditioner requires a small change to existing optimizer code and creates virtually minuscule storage and runtime overhead. Our experimental results with both large language models and text-to-image diffusion models show that with this new preconditioner, the convergence and reliability of SGD and AdamW can be significantly enhanced. Moreover, the training process becomes much more robust to hyperparameter choices such as learning rate. The new preconditioner can be derived from a novel Riemannian metric in low-rank matrix field. Code can be accessed at https://github.com/pilancilab/Riemannian_Preconditioned_LoRA.

Riemannian Preconditioned LoRA for Fine-Tuning Foundation Models

TL;DR

This work introduces a lightweight

Riemannian preconditioner for LoRA-based fine-tuning of foundation models, deriving the updates from a novel metric on the low-rank quotient manifold. The preconditioned updates

and

project gradients onto the row and column subspaces, effectively stabilizing feature learning in the infinite-width limit and removing the need for separate learning-rate tuning for

and

. Empirically, scaled GD/AdamW outperform unscaled optimizers across GPT-2, Mistral 7B, Stable Diffusion, and Mix-of-Show diffusion models, with negligible runtime overhead and improved LR-robustness. Theoretical results establish convergence of the scaled method for a reparameterized two-layer ReLU network, with a rate that does not depend on data conditioning under suitable initialization. Overall, the approach offers a practical, scalable enhancement to PEFT fine-tuning with strong theoretical backing and broad empirical validation.

Abstract

preconditioner in each gradient step where

is the LoRA rank. We theoretically verify that the proposed preconditioner stabilizes feature learning with LoRA under infinite-width NN setting. Empirically, the implementation of this new preconditioner requires a small change to existing optimizer code and creates virtually minuscule storage and runtime overhead. Our experimental results with both large language models and text-to-image diffusion models show that with this new preconditioner, the convergence and reliability of SGD and AdamW can be significantly enhanced. Moreover, the training process becomes much more robust to hyperparameter choices such as learning rate. The new preconditioner can be derived from a novel Riemannian metric in low-rank matrix field. Code can be accessed at https://github.com/pilancilab/Riemannian_Preconditioned_LoRA.

Paper Structure (38 sections, 6 theorems, 61 equations, 14 figures, 7 tables)

This paper contains 38 sections, 6 theorems, 61 equations, 14 figures, 7 tables.

Introduction
Notation
Theoretical Insights
Stable Feature Learning
A Riemannian Metric Formulation
Empirical Results
Algorithms and Simple Implementation
Runtime Comparison
LLM Fine-Tuning
GPT-2
Mistral 7B
Diffusion Model Fine-Tuning
Object Generation
Face Generation
Convergence Theory
...and 23 more sections

Key Result

Theorem 4.1

[Stable Feature Learning (Informal)] Assume LoRA parameters $A$ and $B$ are trained with Adam scaled by our preconditioner as in formula. Further assume that $BAx$ has dimension $\Theta(n).$ Then the LoRA model achieves stable feature learning with $\eta=\Theta(1).$ While for unscaled Adam, $\eta_A=

Figures (14)

Figure 1: Generation results for prompt "a blue $\langle V_{\text{vase}} \rangle$" after fine-tuning on $6$ red vase images of the Stable Diffusion V1.5 model. No black images are observed for our method (scaled AdamW)'s generation and AdamW generates only black images for large learning rates. Our method generates photos better capturing the prompt and is more robust to learning rate changes. See Section \ref{['object_section']} for experimental details.
Figure 2: Runtime for LoRA fine-tuning GPT-2 medium model with different optimizers. Our scaled methods introduce negligible runtime overhead and train as fast as unscaled methods. See Section \ref{['runtime_section']} for experimental details. Here we set $r=4.$
Figure 3: Generation results for prompt "a pencil sketch of $\langle V_{\text{potter}} \rangle$" by Mix-of-Show model with different optimizers and various learning rates. Our method (scaled AdamW) generates photos better capturing the prompt, i.e., a pencil sketch, and is more robust to learning rate choices. See Section \ref{['face_gen']} for experimental details.
Figure 4: Runtime for LoRA fine-tuning GPT-2 medium model with rank $r=256$ with different optimizers. Our scaled methods introduce marginal runtime overhead and train as fast as unscaled methods. See Section \ref{['gpt2_section']} for experimental details.
Figure 5: Generation results for prompt "a yellow $\langle V_{\text{chair}} \rangle$" after fine-tuning on $5$ blue chair images of the Stable Diffusion V1.5 model. We vary text-encoder learning rates with U-Net learning rate fixed to default value $1e-4.$ No black images are observed for our method's generation and AdamW generates only black images for large learning rates. Our method (scaled AdamW) generates photos better capturing the prompt and is more robust to learning rate changes. See Appendix \ref{['stable_diffusion_append']} for experimental details.
...and 9 more figures

Theorems & Definitions (15)

Theorem 4.1
Definition 7.1
Definition 7.3
Theorem 7.4
proof
Definition 1.1
Lemma 1.3
proof
Theorem 1.4
proof
...and 5 more

Riemannian Preconditioned LoRA for Fine-Tuning Foundation Models

TL;DR

Abstract

Riemannian Preconditioned LoRA for Fine-Tuning Foundation Models

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (14)

Theorems & Definitions (15)