Table of Contents
Fetching ...

MixLoRA: Enhancing Large Language Models Fine-Tuning with LoRA-based Mixture of Experts

Dengchun Li, Yingzi Ma, Naizheng Wang, Zhengmao Ye, Zhiyuan Cheng, Yinghao Tang, Yan Zhang, Lei Duan, Jie Zuo, Cal Yang, Mingjie Tang

TL;DR

MixLoRA presents a resource-efficient sparse MoE built from multiple LoRA-based experts that share a frozen FFN, augmented with attention-layer LoRA adapters and an auxiliary load-balance loss. A top-k router dynamically routes tokens to experts, enabling better multi-task generalization while keeping GPU memory and computation in check. A high-throughput framework further reduces token latency and memory usage, making MixLoRA viable on consumer-grade GPUs. Empirical results show notable improvements over LoRA and DoRA in both single-task and multi-task settings across diverse commonsense reasoning benchmarks. The work also provides extensive ablations and optimization strategies to boost efficiency and robustness across model sizes.

Abstract

Fine-tuning Large Language Models (LLMs) is a common practice to adapt pre-trained models for specific applications. While methods like LoRA have effectively addressed GPU memory constraints during fine-tuning, their performance often falls short, especially in multi-task scenarios. In contrast, Mixture-of-Expert (MoE) models, such as Mixtral 8x7B, demonstrate remarkable performance in multi-task learning scenarios while maintaining a reduced parameter count. However, the resource requirements of these MoEs remain challenging, particularly for consumer-grade GPUs with less than 24GB memory. To tackle these challenges, we propose MixLoRA, an approach to construct a resource-efficient sparse MoE model based on LoRA. MixLoRA inserts multiple LoRA-based experts within the feed-forward network block of a frozen pre-trained dense model and employs a commonly used top-k router. Unlike other LoRA-based MoE methods, MixLoRA enhances model performance by utilizing independent attention-layer LoRA adapters. Additionally, an auxiliary load balance loss is employed to address the imbalance problem of the router. Our evaluations show that MixLoRA improves about 9% accuracy compared to state-of-the-art PEFT methods in multi-task learning scenarios. We also propose a new high-throughput framework to alleviate the computation and memory bottlenecks during the training and inference of MOE models. This framework reduces GPU memory consumption by 40% and token computation latency by 30% during both training and inference.

MixLoRA: Enhancing Large Language Models Fine-Tuning with LoRA-based Mixture of Experts

TL;DR

MixLoRA presents a resource-efficient sparse MoE built from multiple LoRA-based experts that share a frozen FFN, augmented with attention-layer LoRA adapters and an auxiliary load-balance loss. A top-k router dynamically routes tokens to experts, enabling better multi-task generalization while keeping GPU memory and computation in check. A high-throughput framework further reduces token latency and memory usage, making MixLoRA viable on consumer-grade GPUs. Empirical results show notable improvements over LoRA and DoRA in both single-task and multi-task settings across diverse commonsense reasoning benchmarks. The work also provides extensive ablations and optimization strategies to boost efficiency and robustness across model sizes.

Abstract

Fine-tuning Large Language Models (LLMs) is a common practice to adapt pre-trained models for specific applications. While methods like LoRA have effectively addressed GPU memory constraints during fine-tuning, their performance often falls short, especially in multi-task scenarios. In contrast, Mixture-of-Expert (MoE) models, such as Mixtral 8x7B, demonstrate remarkable performance in multi-task learning scenarios while maintaining a reduced parameter count. However, the resource requirements of these MoEs remain challenging, particularly for consumer-grade GPUs with less than 24GB memory. To tackle these challenges, we propose MixLoRA, an approach to construct a resource-efficient sparse MoE model based on LoRA. MixLoRA inserts multiple LoRA-based experts within the feed-forward network block of a frozen pre-trained dense model and employs a commonly used top-k router. Unlike other LoRA-based MoE methods, MixLoRA enhances model performance by utilizing independent attention-layer LoRA adapters. Additionally, an auxiliary load balance loss is employed to address the imbalance problem of the router. Our evaluations show that MixLoRA improves about 9% accuracy compared to state-of-the-art PEFT methods in multi-task learning scenarios. We also propose a new high-throughput framework to alleviate the computation and memory bottlenecks during the training and inference of MOE models. This framework reduces GPU memory consumption by 40% and token computation latency by 30% during both training and inference.
Paper Structure (23 sections, 9 equations, 12 figures, 9 tables, 1 algorithm)

This paper contains 23 sections, 9 equations, 12 figures, 9 tables, 1 algorithm.

Figures (12)

  • Figure 1: The timeline of public LoRA-MoE methods' release dates, including the detailed model information on the position of integration, how to train with the LoRA-MoE method (router and load balance), and the problems they aim to solve.
  • Figure 2: The architecture of MixLoRA transformer block. MixLoRA consists of n experts formed by an original FFN sublayer combined with different LoRAs, where the weights of the FFN sublayer are shared among all experts.
  • Figure 3: Comparison of the forward propagation processes: (a) the process in a vanilla MixLoRA MoE block; (b) the optimized process that shares computation results of $W_1$ and $W_3$ to reduce computational complexity.
  • Figure 4: Ablation studies on router loss coefficient (a) and rank (b) on LLaMA2 7B. (c) MixLoRA outperforms LoRA and DoRA without introducing significant latency on diverse commonsense tasks.
  • Figure : (a)
  • ...and 7 more figures