Table of Contents
Fetching ...

Self-Routing: Parameter-Free Expert Routing from Hidden States

Jama Hussein Mohamud, Drew Wagner, Mirco Ravanelli

Abstract

Mixture-of-Experts (MoE) layers increase model capacity by activating only a small subset of experts per token, and typically rely on a learned router to map hidden states to expert assignments. In this work, we ask whether a dedicated learned router is strictly necessary in the MoE settings we study. We propose Self-Routing, a parameter-free routing mechanism that uses a designated subspace of the token hidden state directly as expert logits, eliminating the router projection entirely while leaving the rest of the MoE layer unchanged. We evaluate Self-Routing on GPT-2-scale language modeling and ImageNet-1K classification by comparing it against a standard learned router, random-routing baselines, and dense non-MoE baselines. Our results show that Self-Routing remains competitive with the learned-router baseline while removing all dedicated routing parameters, and yields more balanced expert utilization, with about 17 % higher average normalized routing entropy and no explicit load-balancing loss. On ImageNet-1K with DeiT-S/16, Self-Routing also slightly improves over the corresponding learned-router MoE. These findings suggest that effective MoE routing can emerge from the hidden representation itself without requiring a separate learned router module.

Self-Routing: Parameter-Free Expert Routing from Hidden States

Abstract

Mixture-of-Experts (MoE) layers increase model capacity by activating only a small subset of experts per token, and typically rely on a learned router to map hidden states to expert assignments. In this work, we ask whether a dedicated learned router is strictly necessary in the MoE settings we study. We propose Self-Routing, a parameter-free routing mechanism that uses a designated subspace of the token hidden state directly as expert logits, eliminating the router projection entirely while leaving the rest of the MoE layer unchanged. We evaluate Self-Routing on GPT-2-scale language modeling and ImageNet-1K classification by comparing it against a standard learned router, random-routing baselines, and dense non-MoE baselines. Our results show that Self-Routing remains competitive with the learned-router baseline while removing all dedicated routing parameters, and yields more balanced expert utilization, with about 17 % higher average normalized routing entropy and no explicit load-balancing loss. On ImageNet-1K with DeiT-S/16, Self-Routing also slightly improves over the corresponding learned-router MoE. These findings suggest that effective MoE routing can emerge from the hidden representation itself without requiring a separate learned router module.

Paper Structure

This paper contains 12 sections, 6 equations, 4 figures, 2 tables.

Figures (4)

  • Figure 1: Learned routing vs. Self-Routing. Self-Routing replaces the learned projection with a direct hidden-state readout while leaving the rest of the MoE top-$k$ dispatch unchanged.
  • Figure 2: Per-layer normalized expert-utilization entropy. Self-Routing achieves higher normalized entropy than the learned and fixed random projection routers across most MoE layers. The dashed line marks normalized entropy 1, corresponding to perfectly uniform usage of all 8 experts.
  • Figure 3: Layer-by-expert routing fractions. Each row is a layer and each column is an expert. Hotter colors indicate a larger fraction of routed tokens. Self-Routing exhibits the most even allocation pattern after the earliest layers, while the learned router and fixed random projection show stronger expert concentration and more inactive experts.
  • Figure 4: Maximum expert fraction by layer. For each layer, we plot the fraction of tokens captured by the single most-used expert. Lower values indicate less concentration. Self-Routing is consistently less dominated by a single expert than either the learned router or fixed random projection.