Table of Contents
Fetching ...

MoEs Are Stronger than You Think: Hyper-Parallel Inference Scaling with RoE

Soheil Zibakhsh, Mohammad Samragh, Kumari Nishu, Lauren Hannah, Arnav Kundu, Minsik Cho

TL;DR

This paper introduces hyper-parallel scaling as a new paradigm to boost language model quality by diversifying internal computations at inference time. It operationalizes this idea through RoE, a training-free technique that treats a single MoE as a dynamic ensemble by sampling diverse expert routes per token and aggregating their outputs. The authors propose Gumbel-Top-K routing with layer-specific temperature, paired with efficient batching and a Clean Cache KV strategy to manage compute and memory. Empirical results show RoE improves performance across math, commonsense, and code benchmarks, matching or approaching larger MoE models at substantially lower inference cost. This approach offers a practical path to enhance open-ended generation without model fine-tuning.

Abstract

The generation quality of large language models (LLMs) is often improved by utilizing inference-time sequence-level scaling methods (e.g., Chain-of-Thought). We introduce hyper-parallel scaling, a complementary framework that improves prediction quality at the token level. Hyper-parallel scaling computes and aggregates multiple output proposals for a single token from the model. We implement this concept in Mixture-of-Experts (MoE) models, which we refer to as Roster of Experts (RoE). RoE is a training-free inference algorithm that turns a single MoE into a dynamic ensemble of MoEs. RoE injects controlled stochasticity into the expert routing mechanism, enabling it to sample multiple diverse experts for each token and aggregate their outputs for a more accurate final prediction. To overcome the computational cost, we introduce an efficient batching strategy and a specialized KV-caching mechanism that minimizes compute and memory overhead. For example, RoE enables a 7B MoE model to match the performance of a 10.5B MoE model while using 30% less compute for inference. These gains are achieved without any fine-tuning of model parameters.

MoEs Are Stronger than You Think: Hyper-Parallel Inference Scaling with RoE

TL;DR

This paper introduces hyper-parallel scaling as a new paradigm to boost language model quality by diversifying internal computations at inference time. It operationalizes this idea through RoE, a training-free technique that treats a single MoE as a dynamic ensemble by sampling diverse expert routes per token and aggregating their outputs. The authors propose Gumbel-Top-K routing with layer-specific temperature, paired with efficient batching and a Clean Cache KV strategy to manage compute and memory. Empirical results show RoE improves performance across math, commonsense, and code benchmarks, matching or approaching larger MoE models at substantially lower inference cost. This approach offers a practical path to enhance open-ended generation without model fine-tuning.

Abstract

The generation quality of large language models (LLMs) is often improved by utilizing inference-time sequence-level scaling methods (e.g., Chain-of-Thought). We introduce hyper-parallel scaling, a complementary framework that improves prediction quality at the token level. Hyper-parallel scaling computes and aggregates multiple output proposals for a single token from the model. We implement this concept in Mixture-of-Experts (MoE) models, which we refer to as Roster of Experts (RoE). RoE is a training-free inference algorithm that turns a single MoE into a dynamic ensemble of MoEs. RoE injects controlled stochasticity into the expert routing mechanism, enabling it to sample multiple diverse experts for each token and aggregate their outputs for a more accurate final prediction. To overcome the computational cost, we introduce an efficient batching strategy and a specialized KV-caching mechanism that minimizes compute and memory overhead. For example, RoE enables a 7B MoE model to match the performance of a 10.5B MoE model while using 30% less compute for inference. These gains are achieved without any fine-tuning of model parameters.

Paper Structure

This paper contains 24 sections, 1 equation, 10 figures, 1 table.

Figures (10)

  • Figure 1: A categorization of inference-time scaling strategies. (I) Sequential Scaling: Enhancing performance by generating longer, structured outputs like a chain of thought wei2022chain. (II) Parallel Scaling: Generating multiple token sequences and aggregating them, as in Self-Consistency wang2022self. (III) Hyper-Parallel Scaling: A novel paradigm, instantiated by RoE, that aggregates results from diverse internal computation paths on a per-token basis.
  • Figure 2: An illustration of the Roster of Experts (RoE) method. Left: For a single input, $n$ distinct experts are sampled by adding stochasticity to the expert routing at each MoE layer, and the resulting output logits are aggregated to form the final prediction. Right: A closer view of a single MoE layer shows $k=2$ active experts (dark orange), where Gumbel noise (dark blue) is added to the router logits, and the top-$k$ experts are selected based on these modified logits.
  • Figure 3: Performance comparison of base MoE models and RoE on five mathematical, five commonsense, and two code benchmarks. Accuracy is measured by exact match (except for HE and HE+, which use pass@1). Results are averaged over five random seeds. Axes are normalized to $(\text{min}-1, \text{max}+1)$ for visualization.
  • Figure 4: Impact of caching on RoE performance. (a) Resource usage with caching enabled. Peak memory (blue, left axis) and power per token (orange, right axis) show a modest increase as the sample count grows. (b) Latency comparison with and without caching. Without caching, latency per token rises exponentially, highlighting the necessity of caching for scalable RoE inference.
  • Figure 5: Performance and efficiency analysis of RoE. (a) The performance of RoE applied to OLMoE-7B, measured in terms of an equivalent standard MoE model size. Performance is evaluated using perplexity on the WikiText-103 test set. (b) Comparison of the relative increase in latency and memory for RoE (blue) versus scaling up to an equivalently performing MoE model (orange). The numbers on the blue curve indicate the RoE sample size $K$.
  • ...and 5 more figures