ROSA: Random Subspace Adaptation for Efficient Fine-Tuning

Marawan Gamal Abdel Hameed; Aristides Milios; Siva Reddy; Guillaume Rabusseau

ROSA: Random Subspace Adaptation for Efficient Fine-Tuning

Marawan Gamal Abdel Hameed, Aristides Milios, Siva Reddy, Guillaume Rabusseau

TL;DR

ROSA addresses the memory bottleneck of fine-tuning large pretrained language models by introducing Random Subspace Adaptation, which progressively expands the trainable subspace via random SVD-based subspaces while maintaining zero inference latency. The method theoretically overcomes LoRA’s low-rank bias, proving that ROSA can reach the same optimum as full fine-tuning in a bounded number of steps, $T = \lceil \frac{\operatorname{rank}(XW_0 - Y)}{R} \rceil$, and experimentally matches or surpasses full fine-tuning on GLUE and E2E NLG tasks. ROSA achieves these gains with a trainable parameter count reduced by $\rho_{\text{train}} = \frac{MN}{R(M+N)}$, and introduces negligible inference overhead due to periodic subspace re-factorization. The results demonstrate ROSA’s superior expressivity and practical impact for single-task downstream NLP applications, while acknowledging limitations related to multiple-task deployment and suggesting future extensions to convolutional layers.”

Abstract

Model training requires significantly more memory, compared with inference. Parameter efficient fine-tuning (PEFT) methods provide a means of adapting large models to downstream tasks using less memory. However, existing methods such as adapters, prompt tuning or low-rank adaptation (LoRA) either introduce latency overhead at inference time or achieve subpar downstream performance compared with full fine-tuning. In this work we propose Random Subspace Adaptation (ROSA), a method that outperforms previous PEFT methods by a significant margin, while maintaining a zero latency overhead during inference time. In contrast to previous methods, ROSA is able to adapt subspaces of arbitrarily large dimension, better approximating full-finetuning. We demonstrate both theoretically and experimentally that this makes ROSA strictly more expressive than LoRA, without consuming additional memory during runtime. As PEFT methods are especially useful in the natural language processing domain, where models operate on scales that make full fine-tuning very expensive, we evaluate ROSA in two common NLP scenarios: natural language generation (NLG) and natural language understanding (NLU) with GPT-2 and RoBERTa, respectively. We show that on almost every GLUE task ROSA outperforms LoRA by a significant margin, while also outperforming LoRA on NLG tasks. Our code is available at https://github.com/rosa-paper/rosa

ROSA: Random Subspace Adaptation for Efficient Fine-Tuning

TL;DR

, and experimentally matches or surpasses full fine-tuning on GLUE and E2E NLG tasks. ROSA achieves these gains with a trainable parameter count reduced by

, and introduces negligible inference overhead due to periodic subspace re-factorization. The results demonstrate ROSA’s superior expressivity and practical impact for single-task downstream NLP applications, while acknowledging limitations related to multiple-task deployment and suggesting future extensions to convolutional layers.”

Abstract

Paper Structure (26 sections, 4 theorems, 23 equations, 12 figures, 5 tables, 1 algorithm)

This paper contains 26 sections, 4 theorems, 23 equations, 12 figures, 5 tables, 1 algorithm.

Introduction
Related Work
Adapters:
Prompt and prefix tuning:
LoRA:
(IA)3:
AdaLoRA & LASER:
Other methods:
Method
ROSA
Theoretical Analysis
Experiments
Synthetic Data
GLUE Experiments
NLG Experiments
...and 11 more sections

Key Result

Proposition 1

Let $\mathbf{W}_0$ be a weight matrix of a pre-trained model to be fine-tuned. Then, any fine-tuned weight matrix $\mathbf{W}_{\text{LoRA}}$ obtained using LoRA with rank parameter $R$ will be such that $\mathop{\mathrm{rank}}\limits(\mathbf{W}_0 - \mathbf{W}_{\text{LoRA}})\leq R$.

Figures (12)

Figure 1: Illustration of ROSA. Parameter matrix $\mathbf{W}$ is factorized using the singular value decomposition (SVD) and split into smaller trainable matrices $(\mathbf{A}, \mathbf{B})$ and a larger fixed matrix $(\mathbf{W}_{\text{fixed}})$. Gradients during back-propagation are only computed with respect to $(\mathbf{A}, \mathbf{B})$. The split is then merged after a specified number of training iterations, and the process is repeated. ROSA updates an increasingly larger subspace of $\mathbf{W}$ over the course of training while remaining memory efficient.
Figure 3: Memory usage during fine-tuning of $\text{RoBERTa}_\text{base}$ on the CoLA GLUE benchmark task, using ROSA compared with LoRA and full fine-tuning.
Figure 4: Analysing the trade-off between convergence rate and rank values for ROSA. In ROSA low rank values lead to a slower convergence rate. In contrast, LoRA models are limited by their rank. ROSA, LoRA and the baseline models all run in the same amount of time (approximately 155 seconds per epoch).
Figure : (a)
Figure : (a)
...and 7 more figures

Theorems & Definitions (7)

Proposition 1
proof
Theorem 1
Theorem 2
proof
Theorem 1
proof

ROSA: Random Subspace Adaptation for Efficient Fine-Tuning

TL;DR

Abstract

ROSA: Random Subspace Adaptation for Efficient Fine-Tuning

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (12)

Theorems & Definitions (7)