Table of Contents
Fetching ...

A Stronger Mixture of Low-Rank Experts for Fine-Tuning Foundation Models

Mengyang Sun, Yihao Wang, Tao Feng, Dan Zhang, Yifan Zhu, Jie Tang

TL;DR

Inspired by the Riemannian Preconditioners which train LoRA as a sub-space projector, a new training strategy for MoE-LoRA is proposed, to stabilize and boost its feature learning procedure by multi-space projections.

Abstract

In order to streamline the fine-tuning of foundation models, Low-Rank Adapters (LoRAs) have been substantially adopted across various fields, including instruction tuning and domain adaptation. The underlying concept of LoRA involves decomposing a full-rank matrix into the product of two lower-rank matrices, which reduces storage consumption and accelerates the training process. Furthermore, to address the limited expressive capacity of LoRA, the Mixture-of-Expert (MoE) has been introduced for incorporating multiple LoRA adapters. The integration of LoRA experts leads to a visible improvement across several downstream scenes. However, the mixture of LoRAs (MoE-LoRA) still exhibits its low robustness during tuning and inferring. Inspired by the Riemannian Preconditioners which train LoRA as a sub-space projector, we propose a new training strategy for MoE-LoRA, to stabilize and boost its feature learning procedure by multi-space projections. Examinations on SGD and AdamW optimizers demonstrate the effectiveness of our methodology. Source code is available at https://github.com/THUDM/MoELoRA_Riemannian.

A Stronger Mixture of Low-Rank Experts for Fine-Tuning Foundation Models

TL;DR

Inspired by the Riemannian Preconditioners which train LoRA as a sub-space projector, a new training strategy for MoE-LoRA is proposed, to stabilize and boost its feature learning procedure by multi-space projections.

Abstract

In order to streamline the fine-tuning of foundation models, Low-Rank Adapters (LoRAs) have been substantially adopted across various fields, including instruction tuning and domain adaptation. The underlying concept of LoRA involves decomposing a full-rank matrix into the product of two lower-rank matrices, which reduces storage consumption and accelerates the training process. Furthermore, to address the limited expressive capacity of LoRA, the Mixture-of-Expert (MoE) has been introduced for incorporating multiple LoRA adapters. The integration of LoRA experts leads to a visible improvement across several downstream scenes. However, the mixture of LoRAs (MoE-LoRA) still exhibits its low robustness during tuning and inferring. Inspired by the Riemannian Preconditioners which train LoRA as a sub-space projector, we propose a new training strategy for MoE-LoRA, to stabilize and boost its feature learning procedure by multi-space projections. Examinations on SGD and AdamW optimizers demonstrate the effectiveness of our methodology. Source code is available at https://github.com/THUDM/MoELoRA_Riemannian.

Paper Structure

This paper contains 22 sections, 11 equations, 3 figures, 10 tables, 1 algorithm.

Figures (3)

  • Figure 1: The whole MoE-LoRA architecture and an insight into its gradient updating process. The left part of this figure shows a pipeline of mixture of LoRAs, which fixes the FFN pretrained weights and trains a series of LoRA adapters together with a routering gate. The right part exhibits how MoE-LoRA is updated. Specifically, we plot an example of a 2-Expert MoE-LoRA in a condition that $g_1<g_2$, which results in a further distorted manifold $g_1B_1A_1$. Here we simply omit the fixed pretrained weights and suppose $X=g_1E_1+g_2E_2$ for convenient display. Since that, for a random step $t$ we plot a state point $\frac{1}{2}X^{(t)}$, which equals to $\frac{{g_1}^{(t)}{B_1}^{(t)}{A_1}^{(t)}+{g_2}^{(t)}{B_2}^{(t)}{A_2}^{(t)}}{2}$ and so that serves as the center point of the two manifold states at $t$. This figure illustrates that $g_1B_1A_1$ has a higher curvature so that its local optimal descent and its global optimal descent projection are more distinct. That indicates a requirement for gate-related preconditioners.
  • Figure 2: Converging Performances of $RSGD_{20,10,4}$ and $gRSGD_{20,10,4}$ MoE-LoRA with Llama-3.2-3B as the foundation model. We plot training and evaluating losses, as well as accuracy metrics for the first 500 steps.
  • Figure 3: Curves of ScienceQA training losses under the optimization of conventional and Riemannian preconditioned SGDs, and also both integrated with the gate-based rescaling approach. Llama-3.2-3B serves as the foundation model.