Table of Contents
Fetching ...

VeRA: Vector-based Random Matrix Adaptation

Dawid J. Kopiczko, Tijmen Blankevoort, Yuki M. Asano

TL;DR

VeRA tackles the storage bottleneck of fine-tuning large models by reparameterizing layer updates using a single shared pair of frozen random matrices and trainable per-layer scaling vectors. This design yields a parameter footprint of |Θ| = L_tuned(d_model + r), dramatically smaller than LoRA, while preserving accuracy across GLUE, E2E, and image classification benchmarks and enabling instruction-tuning of 7B/13B models with orders of magnitude fewer trainable parameters. The approach demonstrates strong empirical results, matching or surpassing LoRA with far fewer trainable parameters and offering practical advantages for per-user or per-task deployment due to negligible inference-time changes and seed-based memory efficiency. Ablation studies confirm the necessity of both scaling vectors and show the robustness of VeRA to initialization choices and matrix-sharing strategies, highlighting its potential for scalable, memory-efficient fine-tuning in diverse domains.

Abstract

Low-rank adapation (LoRA) is a popular method that reduces the number of trainable parameters when finetuning large language models, but still faces acute storage challenges when scaling to even larger models or deploying numerous per-user or per-task adapted models. In this work, we present Vector-based Random Matrix Adaptation (VeRA), which significantly reduces the number of trainable parameters compared to LoRA, yet maintains the same performance. It achieves this by using a single pair of low-rank matrices shared across all layers and learning small scaling vectors instead. We demonstrate its effectiveness on the GLUE and E2E benchmarks, image classification tasks, and show its application in instruction-tuning of 7B and 13B language models.

VeRA: Vector-based Random Matrix Adaptation

TL;DR

VeRA tackles the storage bottleneck of fine-tuning large models by reparameterizing layer updates using a single shared pair of frozen random matrices and trainable per-layer scaling vectors. This design yields a parameter footprint of |Θ| = L_tuned(d_model + r), dramatically smaller than LoRA, while preserving accuracy across GLUE, E2E, and image classification benchmarks and enabling instruction-tuning of 7B/13B models with orders of magnitude fewer trainable parameters. The approach demonstrates strong empirical results, matching or surpassing LoRA with far fewer trainable parameters and offering practical advantages for per-user or per-task deployment due to negligible inference-time changes and seed-based memory efficiency. Ablation studies confirm the necessity of both scaling vectors and show the robustness of VeRA to initialization choices and matrix-sharing strategies, highlighting its potential for scalable, memory-efficient fine-tuning in diverse domains.

Abstract

Low-rank adapation (LoRA) is a popular method that reduces the number of trainable parameters when finetuning large language models, but still faces acute storage challenges when scaling to even larger models or deploying numerous per-user or per-task adapted models. In this work, we present Vector-based Random Matrix Adaptation (VeRA), which significantly reduces the number of trainable parameters compared to LoRA, yet maintains the same performance. It achieves this by using a single pair of low-rank matrices shared across all layers and learning small scaling vectors instead. We demonstrate its effectiveness on the GLUE and E2E benchmarks, image classification tasks, and show its application in instruction-tuning of 7B and 13B language models.
Paper Structure (31 sections, 2 equations, 6 figures, 13 tables)

This paper contains 31 sections, 2 equations, 6 figures, 13 tables.

Figures (6)

  • Figure 1: Schematic comparison of LoRA (left) and VeRA (right). LoRA updates the weights matrix $W$ by training the low-rank matrices $A$ and $B$, with intermediate rank $r$. In VeRA these matrices are frozen, shared across all layers, and adapted with trainable vectors $d$ and $b$, substantially reducing the number of trainable parameters. In both cases, low-rank matrices and vectors can be merged into original weights matrix $W$, introducing no additional latency.
  • Figure 2: Performance of LoRA and VeRA methods for varying ranks on RTE task.
  • Figure 3: Magnitude of the adapted $d$ vector for query and value matrices across layers for RoBERTa-L on the RTE task.
  • Figure 4: Performance gains per 1K trainable parameters on the RTE task for RoBERTalarge model relative to the baseline. Formula: $(\text{accuracy}_{\text{method}} / \text{accuracy}_{\text{baseline}}) / \text{parameters}_{\text{method}} * 100$
  • Figure 5: Cosine similarity of LoRA, VeRA, and random weights across layers.
  • ...and 1 more figures