Table of Contents
Fetching ...

Expert Threshold Routing for Autoregressive Language Modeling with Dynamic Computation Allocation and Load Balancing

Hanchi Sun, Yixin Liu, Yonghui Wu, Lichao Sun

TL;DR

Examination of pretraining experiments shows that in pretraining experiments scaling to 2.4B parameters on FineWeb-Edu, ET achieves 0.067 lower cross-entropy loss than TC-MoE, equivalent to reaching the same performance with 1.6$\times$ fewer tokens.

Abstract

Token-choice Mixture-of-Experts (TC-MoE) routes each token to a fixed number of experts, limiting dynamic computation allocation and requiring auxiliary losses to maintain load balance. We propose Expert Threshold (ET) routing, where each expert maintains an exponential moving average (EMA) threshold estimated from the global token distribution. At both training and inference, each token is independently routed to an expert if its score exceeds the expert's threshold, enabling dynamic computation allocation while achieving load balance without auxiliary losses. This fully causal mechanism eliminates dependence on other tokens in the batch, making it well-suited for autoregressive language modeling. In pretraining experiments scaling to 2.4B parameters on FineWeb-Edu, ET achieves 0.067 lower cross-entropy loss than TC-MoE, equivalent to reaching the same performance with 1.6$\times$ fewer tokens.

Expert Threshold Routing for Autoregressive Language Modeling with Dynamic Computation Allocation and Load Balancing

TL;DR

Examination of pretraining experiments shows that in pretraining experiments scaling to 2.4B parameters on FineWeb-Edu, ET achieves 0.067 lower cross-entropy loss than TC-MoE, equivalent to reaching the same performance with 1.6 fewer tokens.

Abstract

Token-choice Mixture-of-Experts (TC-MoE) routes each token to a fixed number of experts, limiting dynamic computation allocation and requiring auxiliary losses to maintain load balance. We propose Expert Threshold (ET) routing, where each expert maintains an exponential moving average (EMA) threshold estimated from the global token distribution. At both training and inference, each token is independently routed to an expert if its score exceeds the expert's threshold, enabling dynamic computation allocation while achieving load balance without auxiliary losses. This fully causal mechanism eliminates dependence on other tokens in the batch, making it well-suited for autoregressive language modeling. In pretraining experiments scaling to 2.4B parameters on FineWeb-Edu, ET achieves 0.067 lower cross-entropy loss than TC-MoE, equivalent to reaching the same performance with 1.6 fewer tokens.
Paper Structure (64 sections, 2 theorems, 20 equations, 25 figures, 8 tables, 2 algorithms)

This paper contains 64 sections, 2 theorems, 20 equations, 25 figures, 8 tables, 2 algorithms.

Key Result

Theorem 1.2

If the cutoff threshold is represented with $b$ bits of precision (e.g., $b=16$ for bf16 or $b=32$ for fp32), then the future information leakage satisfies $\mathcal{L}_F([1\!:\!N]) \le b$ for all $N$.

Figures (25)

  • Figure 1: evaluation loss for Dense, TC, and ET. Compared to TC, ET achieves a 0.067 final loss gap (TC vs ET), or equivalently reaching same performance level with 1.6x few tokens.
  • Figure 2: Illustration of TC, EC, and ET routing mechanisms and their routing pools. Left: TC routes each token independently to its top-$G$ experts, causing load imbalance. Middle: EC has each expert select its top-$k$ tokens from the batch, requiring access to all tokens including future ones (non-causal). Right: ET routes each token independently by comparing its score against the population's top-$(1/E)$ quantile estimated by an EMA-tracked threshold $c_i$, enabling fully causal routing over the population.
  • Figure 3: Cutoff stability vs expert usage tradeoff. Top Signed cutoff deviation relative to the EMA for EC at 512k batch size. ET stays at zero because routing uses the cutoff EMA directly. Bottom Expert usage for EC at 512k and ET. ET varies around the capacity target while EC remains constant.
  • Figure 4: Per-token expert routing on a GSM8K passage.
  • Figure 5: Expert activation heatmap. Top: EC with batch size 2k shows less specialization. Bottom: ET shows more extreme activation patterns, suggesting more domain-aware routing.
  • ...and 20 more figures

Theorems & Definitions (5)

  • Definition 1.1: Future information leakage
  • Theorem 1.2: Finite-precision cutoff implies constant leakage
  • proof
  • Theorem 1.3
  • proof