Table of Contents
Fetching ...

Beyond SGD, Without SVD: Proximal Subspace Iteration LoRA with Diagonal Fractional K-FAC

Abdulla Jasem Almansoori, Maria Ivanova, Andrey Veprikov, Aleksandr Beznosikov, Samuel Horváth, Martin Takáč

TL;DR

This work proposes LoRSum, a memory-efficient subroutine that closes the gap between training with full steps with low-rank projections (SVDLoRA) and LoRA fine-tuning, and recovers several recently proposed preconditioning methods for LoRA as special cases.

Abstract

Low-Rank Adaptation (LoRA) fine-tunes large models by learning low-rank updates on top of frozen weights, dramatically reducing trainable parameters and memory. In this work, we address the gap between training with full steps with low-rank projections (SVDLoRA) and LoRA fine-tuning. We propose LoRSum, a memory-efficient subroutine that closes this gap for gradient descent by casting LoRA optimization as a proximal sub-problem and solving it efficiently with alternating least squares updates, which we prove to be an implicit block power method. We recover several recently proposed preconditioning methods for LoRA as special cases, and show that LoRSum can also be used for updating a low-rank momentum. In order to address full steps with preconditioned gradient descent, we propose a scaled variant of LoRSum that uses structured metrics such as K-FAC and Shampoo, and we show that storing the diagonal of these metrics still allows them to perform well while remaining memory-efficient. Experiments on a synthetic task, CIFAR-100, and language-model fine-tuning on GLUE, SQuAD v2, and WikiText-103, show that our method can match or improve LoRA baselines given modest compute overhead, while avoiding full-matrix SVD projections and retaining LoRA-style parameter efficiency.

Beyond SGD, Without SVD: Proximal Subspace Iteration LoRA with Diagonal Fractional K-FAC

TL;DR

This work proposes LoRSum, a memory-efficient subroutine that closes the gap between training with full steps with low-rank projections (SVDLoRA) and LoRA fine-tuning, and recovers several recently proposed preconditioning methods for LoRA as special cases.

Abstract

Low-Rank Adaptation (LoRA) fine-tunes large models by learning low-rank updates on top of frozen weights, dramatically reducing trainable parameters and memory. In this work, we address the gap between training with full steps with low-rank projections (SVDLoRA) and LoRA fine-tuning. We propose LoRSum, a memory-efficient subroutine that closes this gap for gradient descent by casting LoRA optimization as a proximal sub-problem and solving it efficiently with alternating least squares updates, which we prove to be an implicit block power method. We recover several recently proposed preconditioning methods for LoRA as special cases, and show that LoRSum can also be used for updating a low-rank momentum. In order to address full steps with preconditioned gradient descent, we propose a scaled variant of LoRSum that uses structured metrics such as K-FAC and Shampoo, and we show that storing the diagonal of these metrics still allows them to perform well while remaining memory-efficient. Experiments on a synthetic task, CIFAR-100, and language-model fine-tuning on GLUE, SQuAD v2, and WikiText-103, show that our method can match or improve LoRA baselines given modest compute overhead, while avoiding full-matrix SVD projections and retaining LoRA-style parameter efficiency.
Paper Structure (47 sections, 2 theorems, 43 equations, 5 figures, 6 tables, 7 algorithms)

This paper contains 47 sections, 2 theorems, 43 equations, 5 figures, 6 tables, 7 algorithms.

Key Result

Proposition 1

For any matrix $\bar{\mathbf{W}}_{t+1} := \mathbf{W}_t + \Delta$, the Frobenius-norm best rank-$r$ approximation $\Pi_r(\bar{\mathbf{W}}_{t+1}) \in \mathop{\mathrm{arg\,min}}\limits_{\mathop{\mathrm{rank}}\nolimits(\mathbf{W})\le r}\|\mathbf{W}-\bar{\mathbf{W}}_{t+1}\|_\mathrm{F}$ is given by the ra

Figures (5)

  • Figure 1: Linear task with full-batch gradients and no momentum (left) vs. mini-batch gradients with momentum (right). $\textsc{PSI-LoRA}\xspace \times K$ means $K$ alternating iterations.
  • Figure 2: Perplexity of GPT-2 fine-tuning on WikiText-103 with LoRA adapters for 3 learning rates per method. Scaled PSI-LoRA is more robust to learning rate choice and achieves better perplexity for all learning rates.
  • Figure 3: Ablation of Scaled PSI-LoRA on GLUE-MNLI.
  • Figure 4: Learning curves of validation main metrics for RoBERTa-base on GLUE tasks. Legend shows learning rates for each method.
  • Figure 5: Stochastic linear task: increasing the rank budget of the low-rank momentum buffer improves the approximation quality and moves PSI-LoRA closer to the SVDLoRA oracle.

Theorems & Definitions (5)

  • Proposition 1: SVDLoRA projection of a full step
  • Theorem 3.1: LoRSum as a warm-started subspace iteration
  • proof
  • Remark 2.1
  • Remark 2.2