MoELoRA: Contrastive Learning Guided Mixture of Experts on Parameter-Efficient Fine-Tuning for Large Language Models

Tongxu Luo; Jiahe Lei; Fangyu Lei; Weihao Liu; Shizhu He; Jun Zhao; Kang Liu

MoELoRA: Contrastive Learning Guided Mixture of Experts on Parameter-Efficient Fine-Tuning for Large Language Models

Tongxu Luo, Jiahe Lei, Fangyu Lei, Weihao Liu, Shizhu He, Jun Zhao, Kang Liu

TL;DR

MoELoRA proposes a parameter-efficient fine-tuning method that treats multiple LoRA modules as a Mixture of Experts and uses a contrastive loss to encourage expert specialization, mitigating random routing. With 8 LoRA-based experts, Top-2 routing, and a joint load-balancing and contrastive objective, it outperforms LoRA across math reasoning tasks and shows competitive results against GPT-3.5 on several benchmarks. The work provides a practical approach to dynamically combining modular adapters in LLM fine-tuning and highlights the remaining limits of PEFT on knowledge-heavy tasks. Overall, MoELoRA advances efficient adaptation of large language models by leveraging MoE dynamics and representation-diversifying objectives for better task-specific behavior.

Abstract

Fine-tuning is often necessary to enhance the adaptability of Large Language Models (LLM) to downstream tasks. Nonetheless, the process of updating billions of parameters demands significant computational resources and training time, which poses a substantial obstacle to the widespread application of large-scale models in various scenarios. To address this issue, Parameter-Efficient Fine-Tuning (PEFT) has emerged as a prominent paradigm in recent research. However, current PEFT approaches that employ a limited set of global parameters (such as LoRA, which adds low-rank approximation matrices to all weights) face challenges in flexibly combining different computational modules in downstream tasks. In this work, we introduce a novel PEFT method: MoELoRA. We consider LoRA as Mixture of Experts (MoE), and to mitigate the random routing phenomenon observed in MoE, we propose the utilization of contrastive learning to encourage experts to learn distinct features. We conducted experiments on 11 tasks in math reasoning and common-sense reasoning benchmarks. With the same number of parameters, our approach outperforms LoRA significantly. In math reasoning, MoELoRA achieved an average performance that was 4.2% higher than LoRA, and demonstrated competitive performance compared to the 175B GPT-3.5 on several benchmarks.

MoELoRA: Contrastive Learning Guided Mixture of Experts on Parameter-Efficient Fine-Tuning for Large Language Models

TL;DR

Abstract

Paper Structure (27 sections, 11 equations, 5 figures, 8 tables)

This paper contains 27 sections, 11 equations, 5 figures, 8 tables.

Introduction
Related Work
Parameter-Efficient Fine-Tuning
Mixture-of-Experts
Contrastive Learning
The Proposed Method
Framework of MoELoRA
Challenge of MoELoRA
Load Imbalance
Random Routing
Auxiliary loss
Load Balancing Loss
Experts Contrastive Loss
Experiments
Experimental Setup
...and 12 more sections

Figures (5)

Figure 1: The Different Architectures for (a)Fine-Tuning, (b)LoRA and (c)proposed method MoELoRA. $\Delta W$ denotes the gradient increment for the downstream tasks. LoRA decomposes $\Delta W$ into two matrices $A$ and $B$ and our proposed MoELoRA can select $A_i$ and $B_i$ corresponding to a specific task for better adaptation. In order to differentiate the capabilities of different experts, we employed contrastive learning on the outputs of the experts.
Figure 2: As shown in the figure, it illustrates the process of calculating the Experts Contrastive Loss. The example uses a sentence input $h \in \mathbb{R}^{T \times d}$, where each token selects the top 2 experts. Initially, each expert updates its respective queue with tokens selected by that expert. Subsequently, the Contrastive Loss is computed using the samples from these queues.
Figure 3: The figure displays the routing of all numeric tokens, which are often assigned to specific experts.
Figure 4: The figure displays the routing of numerical token '2' .
Figure 5: The figure displays the routing of numerical token '4' .

MoELoRA: Contrastive Learning Guided Mixture of Experts on Parameter-Efficient Fine-Tuning for Large Language Models

TL;DR

Abstract

MoELoRA: Contrastive Learning Guided Mixture of Experts on Parameter-Efficient Fine-Tuning for Large Language Models

Authors

TL;DR

Abstract

Table of Contents

Figures (5)