Table of Contents
Fetching ...

Scalable Multi-Domain Adaptation of Language Models using Modular Experts

Peter Schafhalter, Shun Liao, Yanqi Zhou, Chih-Kuan Yeh, Arun Kandoor, James Laudon

TL;DR

Modular Domain Experts (MoDE) is a mixture-of-experts architecture that augments a general PLMs with modular, domain-specialized experts that achieves comparable target performances to full parameter fine-tuning while achieving 1.65% better retention performance.

Abstract

Domain-specific adaptation is critical to maximizing the performance of pre-trained language models (PLMs) on one or multiple targeted tasks, especially under resource-constrained use cases, such as edge devices. However, existing methods often struggle to balance domain-specific performance, retention of general knowledge, and efficiency for training and inference. To address these challenges, we propose Modular Domain Experts (MoDE). MoDE is a mixture-of-experts architecture that augments a general PLMs with modular, domain-specialized experts. These experts are trained independently and composed together via a lightweight training process. In contrast to standard low-rank adaptation methods, each MoDE expert consists of several transformer layers which scale better with more training examples and larger parameter counts. Our evaluation demonstrates that MoDE achieves comparable target performances to full parameter fine-tuning while achieving 1.65% better retention performance. Moreover, MoDE's architecture enables flexible sharding configurations and improves training speeds by up to 38% over state-of-the-art distributed training configurations.

Scalable Multi-Domain Adaptation of Language Models using Modular Experts

TL;DR

Modular Domain Experts (MoDE) is a mixture-of-experts architecture that augments a general PLMs with modular, domain-specialized experts that achieves comparable target performances to full parameter fine-tuning while achieving 1.65% better retention performance.

Abstract

Domain-specific adaptation is critical to maximizing the performance of pre-trained language models (PLMs) on one or multiple targeted tasks, especially under resource-constrained use cases, such as edge devices. However, existing methods often struggle to balance domain-specific performance, retention of general knowledge, and efficiency for training and inference. To address these challenges, we propose Modular Domain Experts (MoDE). MoDE is a mixture-of-experts architecture that augments a general PLMs with modular, domain-specialized experts. These experts are trained independently and composed together via a lightweight training process. In contrast to standard low-rank adaptation methods, each MoDE expert consists of several transformer layers which scale better with more training examples and larger parameter counts. Our evaluation demonstrates that MoDE achieves comparable target performances to full parameter fine-tuning while achieving 1.65% better retention performance. Moreover, MoDE's architecture enables flexible sharding configurations and improves training speeds by up to 38% over state-of-the-art distributed training configurations.

Paper Structure

This paper contains 15 sections, 2 equations, 5 figures, 3 tables.

Figures (5)

  • Figure 1: MoDE overview. MoDE models are divided into blocks, each containing transformer layers from the backbone or an expert (\ref{['f:model-architecture']}). Backbone and expert blocks operate on the same inputs. The model takes a linear combination of their outputs, where the weights are determined by a gating function. \ref{['f:training-procedure']} outlines the training process: Experts are trained independently on specific domains, while the backbone's parameters remain unchanged. For a multi-domain task, experts are modularly composed to enhance the model's performance. Here, the code and chat experts are combined to improve the performance of an interactive coding assistant. A lightweight fine-tuning process updates the experts and the gating function to improve performance on the target task.
  • Figure 2: Example SPMD and MPMD sharding configurations. While the model parallel sharding configuration enabled by SPMD evenly distributes all weights across all TPUs, the provided MPMD sharding configuration executes the backbone and the expert on 2 different meshes consisting of 2 TPUs each which may reduce communication overheads.
  • Figure 3: Scalability of adaptation methods. We find that MoDE scales better than LoRA as the number of training examples and the number of parameters added increases. In the left figure, we increase the number of adapter parameters for LoRA by increasing the rank and MoDE by increasing the number of expert layers, and find that MoDE provides higher accuracy than LoRA with more trainable parameters. In the right figure, we generate versions of the Code with different numbers of training examples, and train a LoRA adapter and a MoDE expert for the same number of training steps on each. Although LoRA provides better accuracy on small datasets up to $\sim$1k training examples, we find that MoDE's accuracy is better on large datasets, demonstrating that MoDE scales better with more training data than LoRA.
  • Figure 4: Evaluation of flexible sharding configurations enabled by the MoDE model architecture.
  • Figure 5: Efficiency of mixture data. We examine how much mixture data is required for training. For left to right, we present the next token accuracy on Code, Math, and English. The x-axis for each subplot is the number of examples from Math + Code mixture data.