SORSA: Singular Values and Orthonormal Regularized Singular Vectors Adaptation of Large Language Models

Yang Cao; Zhao Song

SORSA: Singular Values and Orthonormal Regularized Singular Vectors Adaptation of Large Language Models

Yang Cao, Zhao Song

TL;DR

SORSA addresses the inefficiency and generalization challenges of full parameter fine-tuning by proposing a SVD-based, parameter-efficient adapter with a frozen residual. By decomposing a pre-trained weight $W_0$ into a trainable principal part $W_p = U_p \mathrm{diag}(S_p) V_p^T$ and a frozen residual $W_r$, and by applying an orthonormal regularizer to $U_p$ and $V_p$, SORSA achieves faster convergence and better-conditioned updates, enabling mergeable adapters with no inference latency. The authors provide a convergence analysis for gradient descent on the combined loss and theoretical results showing the regularizer improves the condition number of the weight matrix, supported by Stepwise perturbation bounds. Empirically, SORSA demonstrates superior or competitive performance across NLP benchmarks (e.g., GSM-8K) compared with LoRA, PiSSA, AdaLoRA, and full fine-tuning, while preserving the pre-trained knowledge structure and reducing overfitting in low-data regimes. Overall, SORSA offers a principled, efficient alternative for adapting large language models with strong practical impact for scalable fine-tuning and deployment.

Abstract

In this paper, we propose Singular Values and Orthonormal Regularized Singular Vectors Adaptation, or SORSA, a novel parameter efficient fine-tuning (PEFT) method. Each SORSA adapter consists of two main parts: trainable principal singular weights $W_p = U_p \text{diag}(S_p) V^\top_p$, and frozen residual weights $W_r = U_r \text{diag}(S_r) V^\top_r$. These parts are initialized by performing singular value decomposition (SVD) on pre-trained weights. Moreover, we implement and analyze an orthonormal regularizer, which we prove could decrease the condition number of $W_p$ and make the optimization more efficient. SORSA adapters could be merged during inference, thus eliminating any inference latency. We also introduce a method to analyze the variation of the parameters by performing SVD and discuss and analyze SORSA's superiority in minimizing the alteration in the SVD aspect. After all, SORSA shows a faster convergence than LoRA and PiSSA in our experiments. On the GSM-8K benchmark, Llama 2 7B adapted using SORSA achieved 56.03\% accuracy, surpassing LoRA (42.30\%) and Full FT (49.05\%). We conclude that SORSA offers a new perspective on parameter-efficient fine-tuning, demonstrating remarkable performance.

SORSA: Singular Values and Orthonormal Regularized Singular Vectors Adaptation of Large Language Models

TL;DR

into a trainable principal part

and a frozen residual

, and by applying an orthonormal regularizer to

and

, SORSA achieves faster convergence and better-conditioned updates, enabling mergeable adapters with no inference latency. The authors provide a convergence analysis for gradient descent on the combined loss and theoretical results showing the regularizer improves the condition number of the weight matrix, supported by Stepwise perturbation bounds. Empirically, SORSA demonstrates superior or competitive performance across NLP benchmarks (e.g., GSM-8K) compared with LoRA, PiSSA, AdaLoRA, and full fine-tuning, while preserving the pre-trained knowledge structure and reducing overfitting in low-data regimes. Overall, SORSA offers a principled, efficient alternative for adapting large language models with strong practical impact for scalable fine-tuning and deployment.

Abstract

, and frozen residual weights

. These parts are initialized by performing singular value decomposition (SVD) on pre-trained weights. Moreover, we implement and analyze an orthonormal regularizer, which we prove could decrease the condition number of

and make the optimization more efficient. SORSA adapters could be merged during inference, thus eliminating any inference latency. We also introduce a method to analyze the variation of the parameters by performing SVD and discuss and analyze SORSA's superiority in minimizing the alteration in the SVD aspect. After all, SORSA shows a faster convergence than LoRA and PiSSA in our experiments. On the GSM-8K benchmark, Llama 2 7B adapted using SORSA achieved 56.03\% accuracy, surpassing LoRA (42.30\%) and Full FT (49.05\%). We conclude that SORSA offers a new perspective on parameter-efficient fine-tuning, demonstrating remarkable performance.

Paper Structure (32 sections, 4 theorems, 39 equations, 2 figures, 6 tables)

This paper contains 32 sections, 4 theorems, 39 equations, 2 figures, 6 tables.

Introduction
Related Work
Efficient Computation in Machine Learning.
PEFT Methods.
Condition Numbers in Neural Networks
Preliminary
Notations
PEFT Methods
LoRA
AdaLoRA.
DoRA.
OLoRA.
PiSSA.
Condition Number
Our Method
...and 17 more sections

Key Result

Lemma 5.1

Suppose $\|U_p\|_F \leq M_U$ and $\|V_p\|_F \leq M_V$. Then $\mathcal{L}_\mathrm{reg}$ is Lipschitz continuous in the Frobenius norm: where

Figures (2)

Figure 1: $\Delta D$ and $\Delta \Sigma$ of each trainable parameters during training steps. Numbers in the plot represent layer of the weight. Dots represent mean $\Delta D$ and $\Delta \Sigma$ at specific step. Color from dark to light represent the time step from $0$ to $T$, where in these graphs $T=781$.
Figure 2: The training loss and gradient norm comparison between $\mathrm{SORSA}$, PiSSA, and LoRA on MetaMathQA training of RWKV6 7B and Llama 2 7B. LoRA and PiSSA curves of Llama 2 7B are from mwz24.

Theorems & Definitions (10)

Definition 3.1: Condition Number
Definition 4.1: Orthonormal regularizer
Lemma 5.1: Lipschitz continuity of $\mathcal{L}_\mathrm{reg}$
proof
Theorem 5.4: Linear convergence of $\mathrm{SORSA}$
proof
Lemma 5.5
proof
Theorem 5.6
proof

SORSA: Singular Values and Orthonormal Regularized Singular Vectors Adaptation of Large Language Models

TL;DR

Abstract

SORSA: Singular Values and Orthonormal Regularized Singular Vectors Adaptation of Large Language Models

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (2)

Theorems & Definitions (10)