Table of Contents
Fetching ...

Make LoRA Great Again: Boosting LoRA with Adaptive Singular Values and Mixture-of-Experts Optimization Alignment

Chenghao Fan, Zhenyi Lu, Sichen Liu, Chengfeng Gu, Xiaoye Qu, Wei Wei, Yu Cheng

TL;DR

GOAT introduces an adaptive SVD-structured Mixture-of-Experts framework to boost LoRA fine-tuning of large models. By initializing each MoE expert with distinct singular-value segments and applying a derived scaling to align LoRA gradients with full fine-tuning, GOAT closes the performance gap without changing core architectures. Theoretical results on initialization and gradient alignment underpin the method, while extensive experiments across 25 detectors spanning CV/NLP domains demonstrate state-of-the-art performance and favorable efficiency. GOAT's adaptive priors, gradient scaling, and MoE routing yield faster convergence and robust gains, making parameter-efficient fine-tuning more competitive with full-tune baselines in practice.

Abstract

While Low-Rank Adaptation (LoRA) enables parameter-efficient fine-tuning for Large Language Models (LLMs), its performance often falls short of Full Fine-Tuning (Full FT). Current methods optimize LoRA by initializing with static singular value decomposition (SVD) subsets, leading to suboptimal leveraging of pre-trained knowledge. Another path for improving LoRA is incorporating a Mixture-of-Experts (MoE) architecture. However, weight misalignment and complex gradient dynamics make it challenging to adopt SVD prior to the LoRA MoE architecture. To mitigate these issues, we propose \underline{G}reat L\underline{o}R\underline{A} Mixture-of-Exper\underline{t} (GOAT), a framework that (1) adaptively integrates relevant priors using an SVD-structured MoE, and (2) aligns optimization with full fine-tuned MoE by deriving a theoretical scaling factor. We demonstrate that proper scaling, without modifying the architecture or training algorithms, boosts LoRA MoE's efficiency and performance. Experiments across 25 datasets, including natural language understanding, commonsense reasoning, image classification, and natural language generation, demonstrate GOAT's state-of-the-art performance, closing the gap with Full FT.

Make LoRA Great Again: Boosting LoRA with Adaptive Singular Values and Mixture-of-Experts Optimization Alignment

TL;DR

GOAT introduces an adaptive SVD-structured Mixture-of-Experts framework to boost LoRA fine-tuning of large models. By initializing each MoE expert with distinct singular-value segments and applying a derived scaling to align LoRA gradients with full fine-tuning, GOAT closes the performance gap without changing core architectures. Theoretical results on initialization and gradient alignment underpin the method, while extensive experiments across 25 detectors spanning CV/NLP domains demonstrate state-of-the-art performance and favorable efficiency. GOAT's adaptive priors, gradient scaling, and MoE routing yield faster convergence and robust gains, making parameter-efficient fine-tuning more competitive with full-tune baselines in practice.

Abstract

While Low-Rank Adaptation (LoRA) enables parameter-efficient fine-tuning for Large Language Models (LLMs), its performance often falls short of Full Fine-Tuning (Full FT). Current methods optimize LoRA by initializing with static singular value decomposition (SVD) subsets, leading to suboptimal leveraging of pre-trained knowledge. Another path for improving LoRA is incorporating a Mixture-of-Experts (MoE) architecture. However, weight misalignment and complex gradient dynamics make it challenging to adopt SVD prior to the LoRA MoE architecture. To mitigate these issues, we propose \underline{G}reat L\underline{o}R\underline{A} Mixture-of-Exper\underline{t} (GOAT), a framework that (1) adaptively integrates relevant priors using an SVD-structured MoE, and (2) aligns optimization with full fine-tuned MoE by deriving a theoretical scaling factor. We demonstrate that proper scaling, without modifying the architecture or training algorithms, boosts LoRA MoE's efficiency and performance. Experiments across 25 datasets, including natural language understanding, commonsense reasoning, image classification, and natural language generation, demonstrate GOAT's state-of-the-art performance, closing the gap with Full FT.

Paper Structure

This paper contains 66 sections, 13 theorems, 65 equations, 7 figures, 15 tables, 1 algorithm.

Key Result

Lemma 2.2

Let $g_t$ be the gradient in full-tuning, and $B$, $A$ be the low-rank weights. At the $t$-th optimization step, the equivalent gradient can be expressed as:

Figures (7)

  • Figure 1: The effect of initializations from different SVD segments $(u_i, \sigma_i, v_i^\top)$ for rank 32 and 128. The performance normalized by min-max scaling.
  • Figure 2: SVD initialization vs. scaling $s$ and rank $r$
  • Figure 3: Illustration of Our Method.Single Low-Rank Adaptation: LoRA reduces trainable parameters by reparameterizing $W$ as $W = W_0 + sBA$, with $B$ and $A$ as low-rank matrices. MoE Fine-tuning: Full MoE fine-tuning, where experts $W^1$ and $W^E$ are selected by the router in this moment. Subfigure (I): Our method replaces the single pair $B, A$ with multiple pairs $\{B^i, A^i\}_{i=1}^E$, initialized from different segments of the SVD of $W_0$ and adaptively selected by the router. Subfigure (II): We align optimization with SVD-structured MoE by separately aligning each expert. $W_{\text{res}}$ ensures the equivalent weight equals $W_0$ before optimization, and we scale each expert’s equivalent gradient to closely approximate full MoE fine-tuning.
  • Figure 4: Training loss curves of Different LoRA methods and Full Fine-tuning MoE on Cars. The balance loss is excluded in the MoE baselines for a fair comparison with single LoRA baselines.
  • Figure 5: Performance of different methods across ranks.
  • ...and 2 more figures

Theorems & Definitions (21)

  • Definition 2.1: Equivalent Weight and Gradient
  • Lemma 2.2
  • Theorem 3.1
  • Theorem 3.2
  • Lemma 3.3
  • Theorem 3.4
  • Theorem 3.5
  • Lemma 2.1
  • proof
  • Lemma : 2.2
  • ...and 11 more