Flexible and Effective Mixing of Large Language Models into a Mixture of Domain Experts

Rhui Dih Lee; Laura Wynter; Raghu Kiran Ganti

Flexible and Effective Mixing of Large Language Models into a Mixture of Domain Experts

Rhui Dih Lee, Laura Wynter, Raghu Kiran Ganti

TL;DR

The toolkit can be used for creating a mixture from models or from adapters, and guidance on defining the architecture of the resulting MOE using the toolkit is offered.

Abstract

We present a toolkit for creating low-cost Mixture-of-Domain-Experts (MOE) from trained models. The toolkit can be used for creating a mixture from models or from adapters. We perform extensive tests and offer guidance on defining the architecture of the resulting MOE using the toolkit. A public repository is available.

Flexible and Effective Mixing of Large Language Models into a Mixture of Domain Experts

TL;DR

The toolkit can be used for creating a mixture from models or from adapters, and guidance on defining the architecture of the resulting MOE using the toolkit is offered.

Abstract

Paper Structure (12 sections, 7 figures, 2 tables)

This paper contains 12 sections, 7 figures, 2 tables.

Introduction
Related work
Augmenting an LLM with other expert LLMs
Experimental results
Low-cost MOE creation is a viable approach
Router training can be beneficial but is not required
Ablation study on llama3-8B
MOE base model has an impact on MOE performance quality
FFN mixing is best overall but LoRA adapter MOE mixing is competitive
Fine-grained router training can be beneficial but training is not required in general
Mixed MOEs can perform better than the baselines and constituent experts
Conclusions

Figures (7)

Figure 1: Illustration of MOE model swapping a its FFN with a set of FFN layers and a router
Figure 2: Main results of MOE creation on Merlinite with 2 and 4 experts. The first four bars in each section provide evaluations with the trained expert models individually. The set of three grey bars next from the left are the 2x MOE, and the three shades of yellow bars are the 4x MOE. The overall observation is that the MOEs are all very competitive, with or without router training and with both 2 and 4 experts. Router training is primarily advantageous on the math tasks, GSK8K-COT in particular.
Figure 3: Illustration of training loss of the Merlinite 4X MOE models with 3 variants of the training paradigm: instruction tuning of the router and router with embedding layer, and extended pre-training of the router. On the right side we show evaluation results using the checkpoints after 1 and 2 epochs of training. The Noisy MOE (no training) evaluation result is the solid red line while the best result of the 4 experts alone is the dashed red line. We see that router training offers benefit on the math tasks.
Figure 4: Illustration of training loss of the Merlinite 2X MOE models with 3 variants of the training paradigm: instruction tuning of the router and router with embedding layer, and extended pre-training of the router. On the right side we show evaluation results using the checkpoints after 1 and 2 epochs of training. The Noisy MOE (no training) evaluation result is the solid red line while the best result of the 4 experts alone is the dashed red line. In the 2x case, router training offers benefit on several of the tasks, not only math tasks.
Figure 5: Ablation study using several variants of our proposed methodology on llama3-8B with the relevant baselines. We evaluate the MOE with different base models, a single router per FFN layer vs. 'fgmlp' having 3 routers per FFN layer, noisy MOE, tuned router, tuned router with tuned embedding layers, and routers on FFN and attention modules. As baselines, we evaluate llama3-8B, instruct-tuned llama3-8B, the best fine-tuned llama3-8B for each task and the best LoRA-adapter-tuned llama3-8B for each task.
...and 2 more figures

Flexible and Effective Mixing of Large Language Models into a Mixture of Domain Experts

TL;DR

Abstract

Flexible and Effective Mixing of Large Language Models into a Mixture of Domain Experts

Authors

TL;DR

Abstract

Table of Contents

Figures (7)