Table of Contents
Fetching ...

GeoLoRA: Geometric integration for parameter efficient fine-tuning

Steffen Schotthöfer, Emanuele Zangrando, Gianluca Ceruti, Francesco Tudisco, Jonas Kusch

TL;DR

GeoLoRA is introduced, a novel approach that addresses limitations of LoRA by leveraging dynamical low-rank approximation theory and outperforms existing methods in both accuracy and computational efficiency.

Abstract

Low-Rank Adaptation (LoRA) has become a widely used method for parameter-efficient fine-tuning of large-scale, pre-trained neural networks. However, LoRA and its extensions face several challenges, including the need for rank adaptivity, robustness, and computational efficiency during the fine-tuning process. We introduce GeoLoRA, a novel approach that addresses these limitations by leveraging dynamical low-rank approximation theory. GeoLoRA requires only a single backpropagation pass over the small-rank adapters, significantly reducing computational cost as compared to similar dynamical low-rank training methods and making it faster than popular baselines such as AdaLoRA. This allows GeoLoRA to efficiently adapt the allocated parameter budget across the model, achieving smaller low-rank adapters compared to heuristic methods like AdaLoRA and LoRA, while maintaining critical convergence, descent, and error-bound theoretical guarantees. The resulting method is not only more efficient but also more robust to varying hyperparameter settings. We demonstrate the effectiveness of GeoLoRA on several state-of-the-art benchmarks, showing that it outperforms existing methods in both accuracy and computational efficiency.

GeoLoRA: Geometric integration for parameter efficient fine-tuning

TL;DR

GeoLoRA is introduced, a novel approach that addresses limitations of LoRA by leveraging dynamical low-rank approximation theory and outperforms existing methods in both accuracy and computational efficiency.

Abstract

Low-Rank Adaptation (LoRA) has become a widely used method for parameter-efficient fine-tuning of large-scale, pre-trained neural networks. However, LoRA and its extensions face several challenges, including the need for rank adaptivity, robustness, and computational efficiency during the fine-tuning process. We introduce GeoLoRA, a novel approach that addresses these limitations by leveraging dynamical low-rank approximation theory. GeoLoRA requires only a single backpropagation pass over the small-rank adapters, significantly reducing computational cost as compared to similar dynamical low-rank training methods and making it faster than popular baselines such as AdaLoRA. This allows GeoLoRA to efficiently adapt the allocated parameter budget across the model, achieving smaller low-rank adapters compared to heuristic methods like AdaLoRA and LoRA, while maintaining critical convergence, descent, and error-bound theoretical guarantees. The resulting method is not only more efficient but also more robust to varying hyperparameter settings. We demonstrate the effectiveness of GeoLoRA on several state-of-the-art benchmarks, showing that it outperforms existing methods in both accuracy and computational efficiency.

Paper Structure

This paper contains 34 sections, 11 theorems, 79 equations, 5 figures, 7 tables, 2 algorithms.

Key Result

Theorem 1

alg_efficient_TDLRT with stochastic (mini-batch) gradients fulfills where $W^{r}_{{t}}$, ${\widehat{W}^r}_{t}$, $W^{r}_{t+1}$ are the low-rank weight matrices at the start of iteration $t+1$, before, and after the truncation step, respectively.

Figures (5)

  • Figure 1: Illustration of simultaneous vs. Riemannian gradient flow. The projector of the simultaneous gradient flow converges to a point $W_{\star}$ such that $\widehat{P}(W_{\star})\nabla\mathcal{L} = 0$. Since $\widehat{P}$ is not an orthogonal projection, the gradient is not orthogonal to the tangent plane, i.e., $W_{\star}$ is suboptimal. For Riemannian gradient flows, the adapter converges to a point $W_{\star}$ such that $P(W_{\star})\nabla\mathcal{L} = 0$. Since $P$ is the orthogonal projection on the tangent space, $W_{\star}$ is a local optimum, i.e., no directions exist in the tangent space $\mathcal{T}_{W_{\star}}\mathcal{M}$, which further decrease the loss. Here, $\mathcal{M}$ denotes the space of low-rank adapters, and $\mathcal{T}_{W_{\star}}\mathcal{M}$ represents the tangent space at the optimal adapter weight $W_{\star}$.
  • Figure 2: Top panels (a, b): GeoLoRA-adapted ViT-32b fine-tuned on Cifar10 with different initial layer ranks, using a learning rate of $1\mathrm{e}{-3}$ and $\tau=0.3$. The total number of trainable parameters converges to a similar steady state, regardless of the initial rank. The differences in validation accuracy between runs are smaller than the variance observed within individual setups. Bottom panels (c, d): GeoLoRA- and AdaLoRA-adapted ViT-32b fine-tuned on Cifar10 with different rank budgets and learning rates. Fields marked with nan indicate that training diverged within the first epoch. GeoLoRA demonstrates significantly greater robustness than AdaLoRA, particularly with high learning rates.
  • Figure 3: Rank distribution of Vit-32b finetuned on Cifar10 for 5 epochs at learning rate $1\rm{e}{-3}$ using GeoLoRA and AdaLoRA.
  • Figure 4: Rank distribution of Vit-32b finetuned on Cifar10 for 5 epochs at learning rate $1\rm{e}{-4}$ using GeoLoRA and AdaLoRA.
  • Figure 5: Time-trace of the matrix elements of SVD-Lora (a) AdaLora (b) and the proposed method GeoLoRA (c) to solve \ref{['eq_matrix_regression']}. SVD-Lora was trained with learning rate $\lambda=0.00178$, which is the largest learning rate for which the optimization remained stable, GeoLoRA allows larger learning rates, set to $\lambda=0.1$. GeoLoRA converges fast to single precision accuracy, whereas SVD-LORA still has a loss value of $1.7$ after $1000$ iterations, due to the heavy oscillations in it's $S$ matrix trajectory (a). Adalora reduces the oscillations, however incorrectly identifies the rank and fails to converge due to the influence of the additional singular values.

Theorems & Definitions (17)

  • Theorem 1: Stochastic descent estimate
  • Theorem 2: Convergence
  • Theorem 3: Error-bound
  • Proposition 1: Global structure preservation
  • Lemma 1
  • Theorem 4
  • proof
  • Theorem 5
  • proof
  • Theorem 6
  • ...and 7 more