Asymmetry in Low-Rank Adapters of Foundation Models

Jiacheng Zhu; Kristjan Greenewald; Kimia Nadjahi; Haitz Sáez de Ocáriz Borde; Rickard Brüel Gabrielsson; Leshem Choshen; Marzyeh Ghassemi; Mikhail Yurochkin; Justin Solomon

Asymmetry in Low-Rank Adapters of Foundation Models

Jiacheng Zhu, Kristjan Greenewald, Kimia Nadjahi, Haitz Sáez de Ocáriz Borde, Rickard Brüel Gabrielsson, Leshem Choshen, Marzyeh Ghassemi, Mikhail Yurochkin, Justin Solomon

TL;DR

The paper analyzes Low-Rank Adaptation (LoRA) for parameter-efficient fine-tuning by decomposing weight updates as $\Delta W = BA$ with $A\in\mathbb{R}^{r\times d_{\mathrm{in}}}$ and $B\in\mathbb{R}^{d_{\mathrm{out}}\times r}$. Through linear and nonlinear analyses and extensive experiments, it demonstrates a fundamental asymmetry: tuning $B$ is more impactful for producing the desired outputs, and a random or fixed $A$ often suffices, enabling 2x parameter reductions without sacrificing performance. Theoretical results using information-theoretic generalization bounds show that one-factor tuning yields tighter generalization bounds than jointly tuning both factors, especially when the input dimension is large. Empirically, across RoBERTa, BART-Large, LLaMA-2, and ViTs, freezing $A$ (or using a random orthogonal $A$) while training only $B$ achieves competitive or superior results compared to standard LoRA, highlighting practical guidance for efficient and generalizable fine-tuning across modalities.

Abstract

Parameter-efficient fine-tuning optimizes large, pre-trained foundation models by updating a subset of parameters; in this class, Low-Rank Adaptation (LoRA) is particularly effective. Inspired by an effort to investigate the different roles of LoRA matrices during fine-tuning, this paper characterizes and leverages unexpected asymmetry in the importance of low-rank adapter matrices. Specifically, when updating the parameter matrices of a neural network by adding a product $BA$, we observe that the $B$ and $A$ matrices have distinct functions: $A$ extracts features from the input, while $B$ uses these features to create the desired output. Based on this observation, we demonstrate that fine-tuning $B$ is inherently more effective than fine-tuning $A$, and that a random untrained $A$ should perform nearly as well as a fine-tuned one. Using an information-theoretic lens, we also bound the generalization of low-rank adapters, showing that the parameter savings of exclusively training $B$ improves the bound. We support our conclusions with experiments on RoBERTa, BART-Large, LLaMA-2, and ViTs.

Asymmetry in Low-Rank Adapters of Foundation Models

TL;DR

The paper analyzes Low-Rank Adaptation (LoRA) for parameter-efficient fine-tuning by decomposing weight updates as

with

and

. Through linear and nonlinear analyses and extensive experiments, it demonstrates a fundamental asymmetry: tuning

is more impactful for producing the desired outputs, and a random or fixed

often suffices, enabling 2x parameter reductions without sacrificing performance. Theoretical results using information-theoretic generalization bounds show that one-factor tuning yields tighter generalization bounds than jointly tuning both factors, especially when the input dimension is large. Empirically, across RoBERTa, BART-Large, LLaMA-2, and ViTs, freezing

(or using a random orthogonal

) while training only

achieves competitive or superior results compared to standard LoRA, highlighting practical guidance for efficient and generalizable fine-tuning across modalities.

Abstract

, we observe that the

and

matrices have distinct functions:

extracts features from the input, while

uses these features to create the desired output. Based on this observation, we demonstrate that fine-tuning

is inherently more effective than fine-tuning

, and that a random untrained

should perform nearly as well as a fine-tuned one. Using an information-theoretic lens, we also bound the generalization of low-rank adapters, showing that the parameter savings of exclusively training

improves the bound. We support our conclusions with experiments on RoBERTa, BART-Large, LLaMA-2, and ViTs.

Paper Structure (27 sections, 5 theorems, 35 equations, 2 figures, 10 tables)

This paper contains 27 sections, 5 theorems, 35 equations, 2 figures, 10 tables.

Introduction
Related Work
Preliminaries & Background
Theoretical Analysis
$A$, $B$ asymmetry in prediction tasks
Multivariate linear least-squares
Nonlinear losses and multilayer models
Advantages of tuning only $B$ over $BA$ together
Number of parameters
Generalization bounds
Discussion of theoretical analysis
Experiments
Natural Language Understanding
Natural Language Generation
Massive Multitask Language Understanding
...and 12 more sections

Key Result

Lemma 4.1

Optimizing $\mathcal{L}(A,B)$ while fixing $A = Q$ with $Q Q^\top = I_r$ yields where $\Sigma = \mathrm{Cov}[ X_{targ} ]$, with expected loss

Figures (2)

Figure 1: Similarity of learned LoRA matrices $A$ & $B$ across layers of a RoBERTa model fine-tuned with different initialization and data settings. $B$s are similar when fine-tuning on the same task (a) and dissimilar when fine-tuning on different tasks (b and c). $A$s are similar when initialized identically (b), even though fine-tuning is done on different tasks, and dissimilar when initialized randomly regardless of the fine-tuning task (a and c). The experiment demonstrates the asymmetric roles of $A$ and $B$ in LoRA.
Figure 2: Similarity of learned LoRA matrices $A$ & $B$ across layers of a RoBERTa model fine-tuned with different initialization and data settings. We compare the results from both conventional LoRA initialization (In Figure (a), (b), and (c), $A$ is initialized as random uniform $B$ is initialized as zero) and a reversed initialization (In Figure (d), (e), and (f), $A$ is initialized as zero $B$ is initialized as random uniform.

Theorems & Definitions (6)

Lemma 4.1: Freezing $A$ yields regression on projected features
Lemma 4.2: Freezing $B$ yields regression on projected outputs
Theorem 4.3: $A$, $B$ output fit asymmetry
Definition 4.4: Fine-tuning algorithms
Lemma 4.5: Generalization bounds on adapting $A$ and/or $B$
Theorem C.1: specialized from xu2017

Asymmetry in Low-Rank Adapters of Foundation Models

TL;DR

Abstract

Asymmetry in Low-Rank Adapters of Foundation Models

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (2)

Theorems & Definitions (6)