Table of Contents
Fetching ...

Flexible and Effective Mixing of Large Language Models into a Mixture of Domain Experts

Rhui Dih Lee, Laura Wynter, Raghu Kiran Ganti

TL;DR

The toolkit can be used for creating a mixture from models or from adapters, and guidance on defining the architecture of the resulting MOE using the toolkit is offered.

Abstract

We present a toolkit for creating low-cost Mixture-of-Domain-Experts (MOE) from trained models. The toolkit can be used for creating a mixture from models or from adapters. We perform extensive tests and offer guidance on defining the architecture of the resulting MOE using the toolkit. A public repository is available.

Flexible and Effective Mixing of Large Language Models into a Mixture of Domain Experts

TL;DR

The toolkit can be used for creating a mixture from models or from adapters, and guidance on defining the architecture of the resulting MOE using the toolkit is offered.

Abstract

We present a toolkit for creating low-cost Mixture-of-Domain-Experts (MOE) from trained models. The toolkit can be used for creating a mixture from models or from adapters. We perform extensive tests and offer guidance on defining the architecture of the resulting MOE using the toolkit. A public repository is available.
Paper Structure (12 sections, 7 figures, 2 tables)

This paper contains 12 sections, 7 figures, 2 tables.

Figures (7)

  • Figure 1: Illustration of MOE model swapping a its FFN with a set of FFN layers and a router
  • Figure 2: Main results of MOE creation on Merlinite with 2 and 4 experts. The first four bars in each section provide evaluations with the trained expert models individually. The set of three grey bars next from the left are the 2x MOE, and the three shades of yellow bars are the 4x MOE. The overall observation is that the MOEs are all very competitive, with or without router training and with both 2 and 4 experts. Router training is primarily advantageous on the math tasks, GSK8K-COT in particular.
  • Figure 3: Illustration of training loss of the Merlinite 4X MOE models with 3 variants of the training paradigm: instruction tuning of the router and router with embedding layer, and extended pre-training of the router. On the right side we show evaluation results using the checkpoints after 1 and 2 epochs of training. The Noisy MOE (no training) evaluation result is the solid red line while the best result of the 4 experts alone is the dashed red line. We see that router training offers benefit on the math tasks.
  • Figure 4: Illustration of training loss of the Merlinite 2X MOE models with 3 variants of the training paradigm: instruction tuning of the router and router with embedding layer, and extended pre-training of the router. On the right side we show evaluation results using the checkpoints after 1 and 2 epochs of training. The Noisy MOE (no training) evaluation result is the solid red line while the best result of the 4 experts alone is the dashed red line. In the 2x case, router training offers benefit on several of the tasks, not only math tasks.
  • Figure 5: Ablation study using several variants of our proposed methodology on llama3-8B with the relevant baselines. We evaluate the MOE with different base models, a single router per FFN layer vs. 'fgmlp' having 3 routers per FFN layer, noisy MOE, tuned router, tuned router with tuned embedding layers, and routers on FFN and attention modules. As baselines, we evaluate llama3-8B, instruct-tuned llama3-8B, the best fine-tuned llama3-8B for each task and the best LoRA-adapter-tuned llama3-8B for each task.
  • ...and 2 more figures