Table of Contents
Fetching ...

Soup-of-Experts: Pretraining Specialist Models via Parameters Averaging

Pierre Ablin, Angelos Katharopoulos, Skyler Seto, David Grangier

TL;DR

Soup-of-Experts delivers fast-instantiated specialist models by learning a linear combination of a large bank of expert parameters conditioned on pretraining-domain weights. It trains a base model with multiple experts and an MLP router to produce domain-specific coefficients, optimizing the expected loss over a meta-distribution of domain mixtures. Empirically, a 110M base with 128 dense experts achieves strong specialization across 16 domains with competitive general performance and favorable deployment cost, outperforming domain-specific training approaches in scalability. The approach offers practical benefits for shipping many specialist models under tight parameter budgets, with smooth extension to low-rank variants and complementary gains to fine-tuning.

Abstract

Machine learning models are routinely trained on a mixture of different data domains. Different domain weights yield very different downstream performances. We propose the Soup-of-Experts, a novel architecture that can instantiate a model at test time for any domain weights with minimal computational cost and without re-training the model. Our architecture consists of a bank of expert parameters, which are linearly combined to instantiate one model. We learn the linear combination coefficients as a function of the input domain weights. To train this architecture, we sample random domain weights, instantiate the corresponding model, and backprop through one batch of data sampled with these domain weights. We demonstrate how our approach obtains small specialized models on several language modeling tasks quickly. Soup-of-Experts are particularly appealing when one needs to ship many different specialist models quickly under a model size constraint.

Soup-of-Experts: Pretraining Specialist Models via Parameters Averaging

TL;DR

Soup-of-Experts delivers fast-instantiated specialist models by learning a linear combination of a large bank of expert parameters conditioned on pretraining-domain weights. It trains a base model with multiple experts and an MLP router to produce domain-specific coefficients, optimizing the expected loss over a meta-distribution of domain mixtures. Empirically, a 110M base with 128 dense experts achieves strong specialization across 16 domains with competitive general performance and favorable deployment cost, outperforming domain-specific training approaches in scalability. The approach offers practical benefits for shipping many specialist models under tight parameter budgets, with smooth extension to low-rank variants and complementary gains to fine-tuning.

Abstract

Machine learning models are routinely trained on a mixture of different data domains. Different domain weights yield very different downstream performances. We propose the Soup-of-Experts, a novel architecture that can instantiate a model at test time for any domain weights with minimal computational cost and without re-training the model. Our architecture consists of a bank of expert parameters, which are linearly combined to instantiate one model. We learn the linear combination coefficients as a function of the input domain weights. To train this architecture, we sample random domain weights, instantiate the corresponding model, and backprop through one batch of data sampled with these domain weights. We demonstrate how our approach obtains small specialized models on several language modeling tasks quickly. Soup-of-Experts are particularly appealing when one needs to ship many different specialist models quickly under a model size constraint.

Paper Structure

This paper contains 18 sections, 8 equations, 12 figures, 3 tables, 3 algorithms.

Figures (12)

  • Figure 1: The Soup-of-Experts and its training pipeline. The Soup-of-Experts consists of shared parameters $S$, $n$ experts parameters $E_1, \dots, E_n$, and an MLP that acts as a routing mechanism. At each optimization step, we sample domain weights $h$ from a meta-distribution $\pi$. These domain weights have two purposes: they are passed through an MLP to give a vector of coefficients $\alpha$ that instantiates a model by combining the experts' weights, and they are used to sample a mini-batch of data following the domain weights law. We then backpropagate through the corresponding loss to update the parameters of the Soup-of-Experts.
  • Figure 2: Data mixture sampling Given several pretraining domains $D_1,\dots,D_k$, an domain weights $h_1,\dots, h_k$, we can train a model on the mixture $\mathrm{mix}(h) = \sum_{i=1}^k h_iD_i$, using the sampling procedure described in \ref{['alg:sampling']}. Domain weights have a critical impact on the downstream performance.
  • Figure 3: Quickly instantiating a small model from a pre-trained Soup-of-Experts Given a specialist dataset with a few samples, we compute the domain weights using \ref{['alg:histogram']}. The domain weights are then passed through the Soup-of-Experts' MLP to get the coefficients $\alpha$ that are then used to merge the experts. This process is quick since the MLP is small, and it requires no training.
  • Figure 4: Training curves of the different methods. The average specialized loss is the average of the loss of the models over $16$ domains from the Pile. The generic loss is the loss of the models on the standard pre-training distribution of RedPajamav2. The x-axis is the training time. This number is roughly proportionnal to number of tokens processed, since in this setting, the cost of instantiating the Soup-of-Experts is small in front of that of backpropagating through the network. The domain experts and CRISP have to train many models, so they are not competitive in this setup. The Soup-of-Experts performs almost similarly to generic pre-training on the generic loss, which means that it holds the general knowledge in the pre-training set, while CRISP and Domain Experts are not good generalists (Domain Experts are even out of the figure limits on the right figure). The Soup-of-Experts gives the best specialists, as seen on the left figure.
  • Figure 5: The gains of Soup-of-Experts during pretraining are maintained during fine-tuning and sometimes lead to large savings. On each of the 16 domains from the PILE, we fine-tune the corresponding instantiated Soup-of-Experts and generic model, with a limited number of fine-tuning tokens. We stop fine-tuning at the point where validation loss starts increasing. Left: Average loss over domains. We see that the Soup-of-Experts maintains its advantage regardless of the number of available fine-tuning tokens. Right: The number of fine-tuning tokens one needs to fine-tune the generic model to reach the same validation loss as the base, not fine-tuned, Soup-of-Experts. For example, on uspto, one needs $10M$ tokens to fine-tune the generic model and reach the same loss as the Soup-of-Experts instantiated on uspto out of the box after pre-training.
  • ...and 7 more figures