Table of Contents
Fetching ...

RoSA: Accurate Parameter-Efficient Fine-Tuning via Robust Adaptation

Mahdi Nikdan, Soroush Tabesh, Elvir Crnčević, Dan Alistarh

TL;DR

RoSA introduces Robust Adaptation, a parameter-efficient fine-tuning method that combines a low-rank adapter with a sparse adapter to approximate full fine-tuning updates. Motivated by robust PCA, RoSA captures both the low-rank structure and sparse outliers in FFT updates, enabling high accuracy at modest parameter budgets. The approach includes a task-adaptive mask generation strategy, an efficient GPU-oriented SDDMM-based backward pass, and a QRoSA variant that quantizes base weights for additional memory savings; experiments on LLaMA2-7B across GSM8k, ViGGO, and SQL show RoSA often matches or surpasses FFT with 40–100x fewer trainable parameters, while QRoSA further reduces memory. These results suggest RoSA can substantially expand the practicality of fine-tuning large language models in resource-constrained environments, bringing FFT-like performance within reach for many real-world tasks.

Abstract

We investigate parameter-efficient fine-tuning (PEFT) methods that can provide good accuracy under limited computational and memory budgets in the context of large language models (LLMs). We present a new PEFT method called Robust Adaptation (RoSA) inspired by robust principal component analysis that jointly trains $\textit{low-rank}$ and $\textit{highly-sparse}$ components on top of a set of fixed pretrained weights to efficiently approximate the performance of a full-fine-tuning (FFT) solution. Across a series of challenging generative tasks such as grade-school math and SQL query generation, which require fine-tuning for good performance, we show that RoSA outperforms LoRA, pure sparse fine-tuning, and alternative hybrid methods at the same parameter budget, and can even recover the performance of FFT on some tasks. We provide system support for RoSA to complement the training algorithm, specifically in the form of sparse GPU kernels which enable memory- and computationally-efficient training, and show that it is also compatible with low-precision base weights, resulting in the first joint representation combining quantization, low-rank and sparse approximations. Our code is available at https://github.com/IST-DASLab/RoSA.

RoSA: Accurate Parameter-Efficient Fine-Tuning via Robust Adaptation

TL;DR

RoSA introduces Robust Adaptation, a parameter-efficient fine-tuning method that combines a low-rank adapter with a sparse adapter to approximate full fine-tuning updates. Motivated by robust PCA, RoSA captures both the low-rank structure and sparse outliers in FFT updates, enabling high accuracy at modest parameter budgets. The approach includes a task-adaptive mask generation strategy, an efficient GPU-oriented SDDMM-based backward pass, and a QRoSA variant that quantizes base weights for additional memory savings; experiments on LLaMA2-7B across GSM8k, ViGGO, and SQL show RoSA often matches or surpasses FFT with 40–100x fewer trainable parameters, while QRoSA further reduces memory. These results suggest RoSA can substantially expand the practicality of fine-tuning large language models in resource-constrained environments, bringing FFT-like performance within reach for many real-world tasks.

Abstract

We investigate parameter-efficient fine-tuning (PEFT) methods that can provide good accuracy under limited computational and memory budgets in the context of large language models (LLMs). We present a new PEFT method called Robust Adaptation (RoSA) inspired by robust principal component analysis that jointly trains and components on top of a set of fixed pretrained weights to efficiently approximate the performance of a full-fine-tuning (FFT) solution. Across a series of challenging generative tasks such as grade-school math and SQL query generation, which require fine-tuning for good performance, we show that RoSA outperforms LoRA, pure sparse fine-tuning, and alternative hybrid methods at the same parameter budget, and can even recover the performance of FFT on some tasks. We provide system support for RoSA to complement the training algorithm, specifically in the form of sparse GPU kernels which enable memory- and computationally-efficient training, and show that it is also compatible with low-precision base weights, resulting in the first joint representation combining quantization, low-rank and sparse approximations. Our code is available at https://github.com/IST-DASLab/RoSA.
Paper Structure (48 sections, 10 equations, 7 figures, 7 tables, 2 algorithms)

This paper contains 48 sections, 10 equations, 7 figures, 7 tables, 2 algorithms.

Figures (7)

  • Figure 1: Illustration of Robust Adaptation (RoSA) applied to a single FC layer: In this instance, the weight matrix is of dimensions $5 \times 4$ and the batch size is $1$. The low-rank adapter has a rank of $2$, and the sparse adapter has a density of $20\%$. Trainable parameters are depicted in green, while red indicates parameters that remain frozen.
  • Figure 2: Comparison of the highest achieved accuracy by a single-epoch adaptation using various methods across three datasets on LLaMA2-7B, taken from our main experiments in Table \ref{['table:main-results']}. (While LoRA and RoSA store parameters in bfloat16bfloat16 we use float32 for FFT since they are more stable). Each bar shows the percentage of accuracy relative to the accuracy achieved by FFT, and the numbers on top of the bars indicate the absolute accuracy.
  • Figure 3: Illustration of the Frobenius norm error (Figure \ref{['fig:rpca_contour']}) of a Robust PCA approximation to the full-fine-tuning update, for an arbitrary layer (l:20, v_proj of LLaMA2-7B, while varying rank and sparsity independently. Figure \ref{['fig:rpca_slice']} depicts slices of Figure \ref{['fig:rpca_contour']} with similar parameter counts, showcasing the trade-off between sparsity and low-rank under different parameter budgets.
  • Figure 4: Illustration of row and column sparsity structure for the RoSA masks. Specifically, a subset of masks in the LLaMA2-7B model is visualized with a max-pool kernel of size 4 and stride 4, showing that a fraction of around 50% of the parameter rows and columns are completely zero.
  • Figure 5: Here we see a visualization of a subset of masks taken from LLaMa2-7B Model trained on GSM8k ($r=16, d=0.6\%$). We can see that most masks visualized here have either a significant number of empty rows or columns. For the purposes of visualization, each mask is max-pooled with a kernel size and stride of 4.
  • ...and 2 more figures