No Need to Talk: Asynchronous Mixture of Language Models

Anastasiia Filippova; Angelos Katharopoulos; David Grangier; Ronan Collobert

No Need to Talk: Asynchronous Mixture of Language Models

Anastasiia Filippova, Angelos Katharopoulos, David Grangier, Ronan Collobert

TL;DR

No Need to Talk proposes SmallTalk LM, an almost asynchronous mixture of language models that uses a lightweight router to assign each input sequence to a single expert, enabling independent training of experts with minimal inter-node communication. By routing with short prefixes and keeping the router much smaller than the experts, the approach achieves lower perplexity than dense baselines for the same training FLOPs and similar inference costs. Empirical results on RedPajama-V2 show perplexity improvements up to about 18% and robust downstream gains on 75% of tasks, with minimal communication overhead and flexible routing. The work demonstrates practical benefits of sparse, router-guided MoE training for real-world LLM deployment and opens paths to scaling to hundreds of experts with efficient routing.

Abstract

We introduce SMALLTALK LM, an innovative method for training a mixture of language models in an almost asynchronous manner. Each model of the mixture specializes in distinct parts of the data distribution, without the need for high-bandwidth communication between the nodes training each model. At inference, a lightweight router directs a given sequence to a single expert, according to a short prefix. This inference scheme naturally uses a fraction of the parameters from the overall mixture model. Unlike prior works on asynchronous LLM training, our routing method does not rely on full corpus clustering or access to metadata, making it more suitable for real-world applications. Our experiments on language modeling demonstrate that SMALLTALK LM achieves significantly lower perplexity than dense model baselines for the same total training FLOPs and an almost identical inference cost. Finally, in our downstream evaluations we outperform the dense baseline on 75% of the tasks.

No Need to Talk: Asynchronous Mixture of Language Models

TL;DR

Abstract

Paper Structure (40 sections, 20 equations, 6 figures, 5 tables, 1 algorithm)

This paper contains 40 sections, 20 equations, 6 figures, 5 tables, 1 algorithm.

Introduction
Method
Background
Language modeling
Mixture of Experts
SmallTalk LM
Routing With Independent Language Models
Deriving a Practical Model
Training Procedure
Balancing the Assignments to Experts
Computational and Communication Cost
Experiments
Experimental Setup
Comparison to the Dense Model
Language Modeling Results
...and 25 more sections

Figures (6)

Figure 1: Balanced assignments. In this scenario we have $3$ experts (columns with different colors) and $3$ sequences to assign (rows) under the constraint that each expert should get $1$ sequence. In (a) we assign sequentially each row, by negative log-likelihood, leading to sub-optimal assignment because the first expert is full when we try to assign the last row. On the other hand, in (b), we first sort wrt to the minimum log-likelihood which results in optimal assignments.
Figure 2: Better perplexity for the same price. Test perplexity comparison between our approach and the dense baseline, as a function of training cost measured in PFLOPs. In (a), we report results for models with $335$M parameters using $4$, $8$, $16$, and $32$ experts and in (b) for models with $1.3$B parameters using $4$, $16$, and $32$ experts. In addition, (c) shows the perplexity comparison between our approach and the dense baseline, plotted against the cumulative number of tokens processed throughout the training for the $1.3$B parameter models. We observe that our method significantly outperforms the baseline across all experimental configurations. Notably, our $335$M parameter model with $32$ experts achieves a perplexity of 9.07, outperforming the $1.3$B dense baseline's perplexity of $9.1$. This improvement is achieved with a training budget of $2.5 \times 10^{21}$ FLOPs, which is comparable to the baseline's $2.2 \times 10^{21}$ FLOPs, while requiring three times less computational cost during inference ($0.87 \times 10^{12}$ FLOPs compared to $2.81 \times 10^{12}$ FLOPs). See § \ref{['experiments:lm']} and App. \ref{['app:lm_exp']} for a detailed description of our experimental setup.
Figure 3: Downstream evaluation. Accuracy with respect to perplexity on (a) ARC Challenge, (b) ARC Easy, (c) HellaSwag and (d) MMLU, for $1.3$B parameter dense baselines trained on $266$B, $1$T and $2$T tokens (empty symbols) and mixture models with $1.3$B parameter experts and $4$, $16$ and $32$ experts respectively (filled symbols). The models that have the same symbol shape have near identical training and inference FLOPs.
Figure 4: Routing analysis.(a) Test perplexity over training steps for different router sizes using a 335M parameter model with 4 experts. We compare routers of sizes 335M (where the model routes data for itself), 110M, 65M, and 4.4M parameters. (b) Test perplexity as a function of routing prefix length during inference for $1.3$B parameter model with $4$, $16$ and $32$ experts. We examine how reducing the prefix length $\hat{M}$ used during inference affects performance when the data is partitioned during training using a prefix size $M \geq \hat{M}$. (c) Test perplexity over training steps for a $335$M parameter model with $16$ experts, comparing our proposed routing using TF-IDF document encoding followed by SVD projection and balanced K-Means clustering.
Figure 5: Experts Do specialize. Test perplexity comparison between our method and the dense baseline on the routed dataset segments for the 1.3B parameter model, using mixtures of 4 experts in (a) (trained on 266B tokens), 16 experts in (b) (1T tokens), and 32 experts in (c) (2T tokens). Each bar represents a dataset segment, with the color intensity indicating the percentage of data assigned to that expert -- darker shades correspond to a higher proportion of data. Overlapping bars depict the perplexity achieved by the dense baseline (translucent) and our proposed mixture model (opaque). The results demonstrate that all experts specialize effectively on their assigned segments of the data distribution, leading to consistent improvements over the baseline. While the data distribution among experts is not perfectly even -- with some experts receiving more data than others -- all experts receive a substantial portion of the data. This shows that each expert contributes meaningfully to the overall performance gains.
...and 1 more figures

No Need to Talk: Asynchronous Mixture of Language Models

TL;DR

Abstract

No Need to Talk: Asynchronous Mixture of Language Models

Authors

TL;DR

Abstract

Table of Contents

Figures (6)