Table of Contents
Fetching ...

PiSSA: Principal Singular Values and Singular Vectors Adaptation of Large Language Models

Fanxu Meng, Zhaohui Wang, Muhan Zhang

TL;DR

PiSSA introduces a SVD-based PEFT approach that tunes only the principal components of a pretrained weight matrix while freezing the residual, yielding faster convergence and improved performance over LoRA across a wide range of LLMs and tasks. By initializing adapters from the top singular values/vectors and keeping the residual frozen, PiSSA preserves base-model capacity and reduces gradient noise in early steps. The method is compatible with 4-bit NF4 quantization through QPiSSA, which reduces quantization error and maintains strong fine-tuning results. Extensive experiments across NLG and NLU benchmarks, model scales, and ranks demonstrate robust advantages over LoRA and QLoRA, with practical benefits like rapid initialization and easy deployment.

Abstract

To parameter-efficiently fine-tune (PEFT) large language models (LLMs), the low-rank adaptation (LoRA) method approximates the model changes $ΔW \in \mathbb{R}^{m \times n}$ through the product of two matrices $A \in \mathbb{R}^{m \times r}$ and $B \in \mathbb{R}^{r \times n}$, where $r \ll \min(m, n)$, $A$ is initialized with Gaussian noise, and $B$ with zeros. LoRA freezes the original model $W$ and updates the "Noise & Zero" adapter, which may lead to slow convergence. To overcome this limitation, we introduce Principal Singular values and Singular vectors Adaptation (PiSSA). PiSSA shares the same architecture as LoRA, but initializes the adaptor matrices $A$ and $B$ with the principal components of the original matrix $W$, and put the remaining components into a residual matrix $W^{res} \in \mathbb{R}^{m \times n}$ which is frozen during fine-tuning. Compared to LoRA, PiSSA updates the principal components while freezing the "residual" parts, allowing faster convergence and enhanced performance. Comparative experiments of PiSSA and LoRA across 12 different models, ranging from 184M to 70B, encompassing 5 NLG and 8 NLU tasks, reveal that PiSSA consistently outperforms LoRA under identical experimental setups. On the GSM8K benchmark, Mistral-7B fine-tuned with PiSSA achieves an accuracy of 72.86%, surpassing LoRA's 67.7% by 5.16%. Due to the same architecture, PiSSA is also compatible with quantization to further reduce the memory requirement of fine-tuning. Compared to QLoRA, QPiSSA exhibits smaller quantization errors in the initial stages. Fine-tuning LLaMA-3-70B on GSM8K, QPiSSA attains an accuracy of 86.05%, exceeding the performances of QLoRA at 81.73%. Leveraging a fast SVD technique, PiSSA can be initialized in only a few seconds, presenting a negligible cost for transitioning from LoRA to PiSSA. Code is available at https://github.com/GraphPKU/PiSSA.

PiSSA: Principal Singular Values and Singular Vectors Adaptation of Large Language Models

TL;DR

PiSSA introduces a SVD-based PEFT approach that tunes only the principal components of a pretrained weight matrix while freezing the residual, yielding faster convergence and improved performance over LoRA across a wide range of LLMs and tasks. By initializing adapters from the top singular values/vectors and keeping the residual frozen, PiSSA preserves base-model capacity and reduces gradient noise in early steps. The method is compatible with 4-bit NF4 quantization through QPiSSA, which reduces quantization error and maintains strong fine-tuning results. Extensive experiments across NLG and NLU benchmarks, model scales, and ranks demonstrate robust advantages over LoRA and QLoRA, with practical benefits like rapid initialization and easy deployment.

Abstract

To parameter-efficiently fine-tune (PEFT) large language models (LLMs), the low-rank adaptation (LoRA) method approximates the model changes through the product of two matrices and , where , is initialized with Gaussian noise, and with zeros. LoRA freezes the original model and updates the "Noise & Zero" adapter, which may lead to slow convergence. To overcome this limitation, we introduce Principal Singular values and Singular vectors Adaptation (PiSSA). PiSSA shares the same architecture as LoRA, but initializes the adaptor matrices and with the principal components of the original matrix , and put the remaining components into a residual matrix which is frozen during fine-tuning. Compared to LoRA, PiSSA updates the principal components while freezing the "residual" parts, allowing faster convergence and enhanced performance. Comparative experiments of PiSSA and LoRA across 12 different models, ranging from 184M to 70B, encompassing 5 NLG and 8 NLU tasks, reveal that PiSSA consistently outperforms LoRA under identical experimental setups. On the GSM8K benchmark, Mistral-7B fine-tuned with PiSSA achieves an accuracy of 72.86%, surpassing LoRA's 67.7% by 5.16%. Due to the same architecture, PiSSA is also compatible with quantization to further reduce the memory requirement of fine-tuning. Compared to QLoRA, QPiSSA exhibits smaller quantization errors in the initial stages. Fine-tuning LLaMA-3-70B on GSM8K, QPiSSA attains an accuracy of 86.05%, exceeding the performances of QLoRA at 81.73%. Leveraging a fast SVD technique, PiSSA can be initialized in only a few seconds, presenting a negligible cost for transitioning from LoRA to PiSSA. Code is available at https://github.com/GraphPKU/PiSSA.
Paper Structure (31 sections, 14 equations, 18 figures, 11 tables, 1 algorithm)

This paper contains 31 sections, 14 equations, 18 figures, 11 tables, 1 algorithm.

Figures (18)

  • Figure 1: The comparison among Full Fine-tuning, training with LoRA, and PiSSA. In this visualization, blue modules represent parts of the model whose parameters are frozen during training, while orange modules indicate components that require updates. QLoRA quantizes the pretrained matrix in LoRA to 4-bit, whereas QPiSSA quantizes the residual matrix in PiSSA.
  • Figure 2: We illustrate the two key advantages of PiSSA: converging faster and better, and reducing quantization error. In the left figure, we use a toy example to show PiSSA's faster convergence, where we first train a two-layer MLP classifying odd numbers of MNIST, and then fine-tune the model on even numbers. PiSSA finds the right direction more quickly and achieves a lower loss with the same number of steps. In the right figure, PiSSA reduces quantization error more effectively than LoftQ li2023loftq, with an optional 5-iteration SVD for further error reduction, as detailed in Appendix \ref{['appendix_sec:quant_error_of_loftq_and_pissa_table']}.
  • Figure 3: Visualizations of LLaMA 2-7B's "layers[0].self_attn.q_proj" matrix, with distributions for the full model shown in Appendix \ref{['appendix_sec:narrower_distribution']}. Figures (a), (b), (d), and (e) display the singular values of $W$, $W^{res}$, $W - nf4(W)$, and $W^{res} - nf4(W^{res})$, respectively. Figures (c) and (f) show the data distributions of $W$ and $W^{res}$.
  • Figure 4: The loss, grad norm, and evaluation accuracy over the training steps of LoRA (indicated in blue), PiSSA (in orange), and full parameter fine-tuning (in red).
  • Figure 5: The loss, grad norm, and evaluation accuracy over the training steps of (Q)LoRA, (Q)PiSSA, LoftQ and full parameter fine-tuning.
  • ...and 13 more figures