Soup-of-Experts: Pretraining Specialist Models via Parameters Averaging
Pierre Ablin, Angelos Katharopoulos, Skyler Seto, David Grangier
TL;DR
Soup-of-Experts delivers fast-instantiated specialist models by learning a linear combination of a large bank of expert parameters conditioned on pretraining-domain weights. It trains a base model with multiple experts and an MLP router to produce domain-specific coefficients, optimizing the expected loss over a meta-distribution of domain mixtures. Empirically, a 110M base with 128 dense experts achieves strong specialization across 16 domains with competitive general performance and favorable deployment cost, outperforming domain-specific training approaches in scalability. The approach offers practical benefits for shipping many specialist models under tight parameter budgets, with smooth extension to low-rank variants and complementary gains to fine-tuning.
Abstract
Machine learning models are routinely trained on a mixture of different data domains. Different domain weights yield very different downstream performances. We propose the Soup-of-Experts, a novel architecture that can instantiate a model at test time for any domain weights with minimal computational cost and without re-training the model. Our architecture consists of a bank of expert parameters, which are linearly combined to instantiate one model. We learn the linear combination coefficients as a function of the input domain weights. To train this architecture, we sample random domain weights, instantiate the corresponding model, and backprop through one batch of data sampled with these domain weights. We demonstrate how our approach obtains small specialized models on several language modeling tasks quickly. Soup-of-Experts are particularly appealing when one needs to ship many different specialist models quickly under a model size constraint.
