Table of Contents
Fetching ...

SORSA: Singular Values and Orthonormal Regularized Singular Vectors Adaptation of Large Language Models

Yang Cao, Zhao Song

TL;DR

SORSA addresses the inefficiency and generalization challenges of full parameter fine-tuning by proposing a SVD-based, parameter-efficient adapter with a frozen residual. By decomposing a pre-trained weight $W_0$ into a trainable principal part $W_p = U_p \mathrm{diag}(S_p) V_p^T$ and a frozen residual $W_r$, and by applying an orthonormal regularizer to $U_p$ and $V_p$, SORSA achieves faster convergence and better-conditioned updates, enabling mergeable adapters with no inference latency. The authors provide a convergence analysis for gradient descent on the combined loss and theoretical results showing the regularizer improves the condition number of the weight matrix, supported by Stepwise perturbation bounds. Empirically, SORSA demonstrates superior or competitive performance across NLP benchmarks (e.g., GSM-8K) compared with LoRA, PiSSA, AdaLoRA, and full fine-tuning, while preserving the pre-trained knowledge structure and reducing overfitting in low-data regimes. Overall, SORSA offers a principled, efficient alternative for adapting large language models with strong practical impact for scalable fine-tuning and deployment.

Abstract

In this paper, we propose Singular Values and Orthonormal Regularized Singular Vectors Adaptation, or SORSA, a novel parameter efficient fine-tuning (PEFT) method. Each SORSA adapter consists of two main parts: trainable principal singular weights $W_p = U_p \text{diag}(S_p) V^\top_p$, and frozen residual weights $W_r = U_r \text{diag}(S_r) V^\top_r$. These parts are initialized by performing singular value decomposition (SVD) on pre-trained weights. Moreover, we implement and analyze an orthonormal regularizer, which we prove could decrease the condition number of $W_p$ and make the optimization more efficient. SORSA adapters could be merged during inference, thus eliminating any inference latency. We also introduce a method to analyze the variation of the parameters by performing SVD and discuss and analyze SORSA's superiority in minimizing the alteration in the SVD aspect. After all, SORSA shows a faster convergence than LoRA and PiSSA in our experiments. On the GSM-8K benchmark, Llama 2 7B adapted using SORSA achieved 56.03\% accuracy, surpassing LoRA (42.30\%) and Full FT (49.05\%). We conclude that SORSA offers a new perspective on parameter-efficient fine-tuning, demonstrating remarkable performance.

SORSA: Singular Values and Orthonormal Regularized Singular Vectors Adaptation of Large Language Models

TL;DR

SORSA addresses the inefficiency and generalization challenges of full parameter fine-tuning by proposing a SVD-based, parameter-efficient adapter with a frozen residual. By decomposing a pre-trained weight into a trainable principal part and a frozen residual , and by applying an orthonormal regularizer to and , SORSA achieves faster convergence and better-conditioned updates, enabling mergeable adapters with no inference latency. The authors provide a convergence analysis for gradient descent on the combined loss and theoretical results showing the regularizer improves the condition number of the weight matrix, supported by Stepwise perturbation bounds. Empirically, SORSA demonstrates superior or competitive performance across NLP benchmarks (e.g., GSM-8K) compared with LoRA, PiSSA, AdaLoRA, and full fine-tuning, while preserving the pre-trained knowledge structure and reducing overfitting in low-data regimes. Overall, SORSA offers a principled, efficient alternative for adapting large language models with strong practical impact for scalable fine-tuning and deployment.

Abstract

In this paper, we propose Singular Values and Orthonormal Regularized Singular Vectors Adaptation, or SORSA, a novel parameter efficient fine-tuning (PEFT) method. Each SORSA adapter consists of two main parts: trainable principal singular weights , and frozen residual weights . These parts are initialized by performing singular value decomposition (SVD) on pre-trained weights. Moreover, we implement and analyze an orthonormal regularizer, which we prove could decrease the condition number of and make the optimization more efficient. SORSA adapters could be merged during inference, thus eliminating any inference latency. We also introduce a method to analyze the variation of the parameters by performing SVD and discuss and analyze SORSA's superiority in minimizing the alteration in the SVD aspect. After all, SORSA shows a faster convergence than LoRA and PiSSA in our experiments. On the GSM-8K benchmark, Llama 2 7B adapted using SORSA achieved 56.03\% accuracy, surpassing LoRA (42.30\%) and Full FT (49.05\%). We conclude that SORSA offers a new perspective on parameter-efficient fine-tuning, demonstrating remarkable performance.
Paper Structure (32 sections, 4 theorems, 39 equations, 2 figures, 6 tables)

This paper contains 32 sections, 4 theorems, 39 equations, 2 figures, 6 tables.

Key Result

Lemma 5.1

Suppose $\|U_p\|_F \leq M_U$ and $\|V_p\|_F \leq M_V$. Then $\mathcal{L}_\mathrm{reg}$ is Lipschitz continuous in the Frobenius norm: where

Figures (2)

  • Figure 1: $\Delta D$ and $\Delta \Sigma$ of each trainable parameters during training steps. Numbers in the plot represent layer of the weight. Dots represent mean $\Delta D$ and $\Delta \Sigma$ at specific step. Color from dark to light represent the time step from $0$ to $T$, where in these graphs $T=781$.
  • Figure 2: The training loss and gradient norm comparison between $\mathrm{SORSA}$, PiSSA, and LoRA on MetaMathQA training of RWKV6 7B and Llama 2 7B. LoRA and PiSSA curves of Llama 2 7B are from mwz24.

Theorems & Definitions (10)

  • Definition 3.1: Condition Number
  • Definition 4.1: Orthonormal regularizer
  • Lemma 5.1: Lipschitz continuity of $\mathcal{L}_\mathrm{reg}$
  • proof
  • Theorem 5.4: Linear convergence of $\mathrm{SORSA}$
  • proof
  • Lemma 5.5
  • proof
  • Theorem 5.6
  • proof