No Need to Talk: Asynchronous Mixture of Language Models
Anastasiia Filippova, Angelos Katharopoulos, David Grangier, Ronan Collobert
TL;DR
No Need to Talk proposes SmallTalk LM, an almost asynchronous mixture of language models that uses a lightweight router to assign each input sequence to a single expert, enabling independent training of experts with minimal inter-node communication. By routing with short prefixes and keeping the router much smaller than the experts, the approach achieves lower perplexity than dense baselines for the same training FLOPs and similar inference costs. Empirical results on RedPajama-V2 show perplexity improvements up to about 18% and robust downstream gains on 75% of tasks, with minimal communication overhead and flexible routing. The work demonstrates practical benefits of sparse, router-guided MoE training for real-world LLM deployment and opens paths to scaling to hundreds of experts with efficient routing.
Abstract
We introduce SMALLTALK LM, an innovative method for training a mixture of language models in an almost asynchronous manner. Each model of the mixture specializes in distinct parts of the data distribution, without the need for high-bandwidth communication between the nodes training each model. At inference, a lightweight router directs a given sequence to a single expert, according to a short prefix. This inference scheme naturally uses a fraction of the parameters from the overall mixture model. Unlike prior works on asynchronous LLM training, our routing method does not rely on full corpus clustering or access to metadata, making it more suitable for real-world applications. Our experiments on language modeling demonstrate that SMALLTALK LM achieves significantly lower perplexity than dense model baselines for the same total training FLOPs and an almost identical inference cost. Finally, in our downstream evaluations we outperform the dense baseline on 75% of the tasks.
