Table of Contents
Fetching ...

Lory: Fully Differentiable Mixture-of-Experts for Autoregressive Language Model Pre-training

Zexuan Zhong, Mengzhou Xia, Danqi Chen, Mike Lewis

TL;DR

Lory introduces a fully differentiable MoE architecture for autoregressive language model pre-training by merging expert parameters and using causal segment routing to preserve autoregressive generation. It pairs this with a similarity-based data batching strategy to drive domain-level expert specialization, enabling training at scale (up to 150B tokens and 32 experts). Empirically, Lory achieves substantial improvements over parameter-matched dense models in perplexity and downstream tasks, while remaining competitive with token-level MoE models. The work demonstrates that fully differentiable MoE can be effective for language modeling, reveals clear domain specialization in learned experts, and lays ground for scalable, efficient future research.

Abstract

Mixture-of-experts (MoE) models facilitate efficient scaling; however, training the router network introduces the challenge of optimizing a non-differentiable, discrete objective. Recently, a fully-differentiable MoE architecture, SMEAR, was proposed (Muqeeth et al., 2023), which softly merges experts in the parameter space; nevertheless, its effectiveness was only demonstrated in downstream fine-tuning on classification tasks. In this paper, we present Lory, the first approach that scales such architectures to autoregressive language model pre-training. Lory introduces two key techniques: (1) a causal segment routing strategy that achieves high efficiency for expert merging operations while preserving the autoregressive nature of language models; (2) a similarity-based data batching method that encourages expert specialization by grouping similar documents in training instances. We pre-train a series of Lory models on 150B tokens from scratch, with up to 32 experts and 30B (1.5B active) parameters. Experimental results show significant performance gains over parameter-matched dense models on both perplexity (+13.9%) and a variety of downstream tasks (+1.5%-11.1%). Despite segment-level routing, Lory models achieve competitive performance compared to state-of-the-art MoE models with token-level routing. We further demonstrate that the trained experts in Lory capture domain-level specialization without supervision. Our work highlights the potential of fully-differentiable MoE architectures for language model pre-training and advocates future research in this area.

Lory: Fully Differentiable Mixture-of-Experts for Autoregressive Language Model Pre-training

TL;DR

Lory introduces a fully differentiable MoE architecture for autoregressive language model pre-training by merging expert parameters and using causal segment routing to preserve autoregressive generation. It pairs this with a similarity-based data batching strategy to drive domain-level expert specialization, enabling training at scale (up to 150B tokens and 32 experts). Empirically, Lory achieves substantial improvements over parameter-matched dense models in perplexity and downstream tasks, while remaining competitive with token-level MoE models. The work demonstrates that fully differentiable MoE can be effective for language modeling, reveals clear domain specialization in learned experts, and lays ground for scalable, efficient future research.

Abstract

Mixture-of-experts (MoE) models facilitate efficient scaling; however, training the router network introduces the challenge of optimizing a non-differentiable, discrete objective. Recently, a fully-differentiable MoE architecture, SMEAR, was proposed (Muqeeth et al., 2023), which softly merges experts in the parameter space; nevertheless, its effectiveness was only demonstrated in downstream fine-tuning on classification tasks. In this paper, we present Lory, the first approach that scales such architectures to autoregressive language model pre-training. Lory introduces two key techniques: (1) a causal segment routing strategy that achieves high efficiency for expert merging operations while preserving the autoregressive nature of language models; (2) a similarity-based data batching method that encourages expert specialization by grouping similar documents in training instances. We pre-train a series of Lory models on 150B tokens from scratch, with up to 32 experts and 30B (1.5B active) parameters. Experimental results show significant performance gains over parameter-matched dense models on both perplexity (+13.9%) and a variety of downstream tasks (+1.5%-11.1%). Despite segment-level routing, Lory models achieve competitive performance compared to state-of-the-art MoE models with token-level routing. We further demonstrate that the trained experts in Lory capture domain-level specialization without supervision. Our work highlights the potential of fully-differentiable MoE architectures for language model pre-training and advocates future research in this area.
Paper Structure (48 sections, 3 equations, 11 figures, 9 tables, 1 algorithm)

This paper contains 48 sections, 3 equations, 11 figures, 9 tables, 1 algorithm.

Figures (11)

  • Figure 1: We propose Lory, a fully differentiable MoE architecture designed for autoregressive language models based on expert merging (Section \ref{['sec:expert_merging']}). We introduce two key techniques to train Lory: First, we propose the causal segment routing strategy, which conducts expert merging at the segment level and preserves the autoregressive property of language models. Second, we use the similarity-based data batching method to construct training instances, which steers the experts toward specializing in specific domains or topics.
  • Figure 2: Left: training curves (log perplexity) of models with different sizes and experts. Right: Perplexity of trained models on different evaluation sets (arXiv, Books, Wikipedia, C4, and Python). We include the detailed model configurations and sizes in Appendix \ref{['app:model_config']}.
  • Figure 3: Training curves of causal segment routing and prefix routing. The latter is a straightforward segment-level routing strategy that uses the first segment to route the entire input.
  • Figure 4: Left: Training curves of similarity-based data batching (sim batch) or the standard random batching (rand batch). Right: Training loss difference between Lory and a dense model when using different batching strategies. Lory leads to a larger loss improvement over the dense model when using similarity-based data batching.
  • Figure 5: Comparison with the state-of-the-art MoE training technique Expert Choice (EC) with a segment-level or token-level routing. For both EC models, we use the capacity factor of $1$ with the same amount of FLOPs as our training method for the fair comparison.
  • ...and 6 more figures