Table of Contents
Fetching ...

ERMoE: Eigen-Reparameterized Mixture-of-Experts for Stable Routing and Interpretable Specialization

Anzhe Cheng, Shukai Duan, Shixuan Li, Chenzhong Yin, Mingxi Cheng, Heng Ping, Tamoghna Chattopadhyay, Sophia I Thomopoulos, Shahin Nazarian, Paul Thompson, Paul Bogdan

TL;DR

ERMoE tackles fundamental MoE routing challenges by reparameterizing each expert in a learned orthonormal eigenbasis and routing tokens via an eigenbasis alignment score rather than learned logits. This content-aware routing eliminates the need for auxiliary load-balancing losses, stabilizes expert utilization, and yields interpretable specialization. Across vision benchmarks, CLIP-style cross-modal retrieval, and 3D brain MRI brain-age tasks, ERMoE achieves state-of-the-art or competitive accuracy with more balanced expert usage and clear anatomical or subspace interpretations. The approach scales to 3D volumetric data (ERMoE-ba) and demonstrates practical impact through improved performance, calibration, and model efficiency, suggesting a robust architectural principle for sparse MoEs.

Abstract

Mixture-of-Experts (MoE) architectures expand model capacity by sparsely activating experts but face two core challenges: misalignment between router logits and each expert's internal structure leads to unstable routing and expert underutilization, and load imbalances create straggler bottlenecks. Standard solutions, such as auxiliary load-balancing losses, can reduce load disparities but often weaken expert specialization and hurt downstream performance. To address these issues, we propose ERMoE, a sparse MoE transformer that reparameterizes each expert in a learned orthonormal eigenbasis and replaces learned gating logits with an "Eigenbasis Score", defined as the cosine similarity between input features and an expert's basis. This content-aware routing ties token assignments directly to experts' representation spaces, stabilizing utilization and promoting interpretable specialization without sacrificing sparsity. Crucially, ERMoE removes the need for explicit balancing losses and avoids the interfering gradients they introduce. We show that ERMoE achieves state-of-the-art accuracy on ImageNet classification and cross-modal image-text retrieval benchmarks (e.g., COCO, Flickr30K), while naturally producing flatter expert load distributions. Moreover, a 3D MRI variant (ERMoE-ba) improves brain age prediction accuracy by more than 7\% and yields anatomically interpretable expert specializations. ERMoE thus introduces a new architectural principle for sparse expert models that directly addresses routing instabilities and enables improved performance with scalable, interpretable specialization.

ERMoE: Eigen-Reparameterized Mixture-of-Experts for Stable Routing and Interpretable Specialization

TL;DR

ERMoE tackles fundamental MoE routing challenges by reparameterizing each expert in a learned orthonormal eigenbasis and routing tokens via an eigenbasis alignment score rather than learned logits. This content-aware routing eliminates the need for auxiliary load-balancing losses, stabilizes expert utilization, and yields interpretable specialization. Across vision benchmarks, CLIP-style cross-modal retrieval, and 3D brain MRI brain-age tasks, ERMoE achieves state-of-the-art or competitive accuracy with more balanced expert usage and clear anatomical or subspace interpretations. The approach scales to 3D volumetric data (ERMoE-ba) and demonstrates practical impact through improved performance, calibration, and model efficiency, suggesting a robust architectural principle for sparse MoEs.

Abstract

Mixture-of-Experts (MoE) architectures expand model capacity by sparsely activating experts but face two core challenges: misalignment between router logits and each expert's internal structure leads to unstable routing and expert underutilization, and load imbalances create straggler bottlenecks. Standard solutions, such as auxiliary load-balancing losses, can reduce load disparities but often weaken expert specialization and hurt downstream performance. To address these issues, we propose ERMoE, a sparse MoE transformer that reparameterizes each expert in a learned orthonormal eigenbasis and replaces learned gating logits with an "Eigenbasis Score", defined as the cosine similarity between input features and an expert's basis. This content-aware routing ties token assignments directly to experts' representation spaces, stabilizing utilization and promoting interpretable specialization without sacrificing sparsity. Crucially, ERMoE removes the need for explicit balancing losses and avoids the interfering gradients they introduce. We show that ERMoE achieves state-of-the-art accuracy on ImageNet classification and cross-modal image-text retrieval benchmarks (e.g., COCO, Flickr30K), while naturally producing flatter expert load distributions. Moreover, a 3D MRI variant (ERMoE-ba) improves brain age prediction accuracy by more than 7\% and yields anatomically interpretable expert specializations. ERMoE thus introduces a new architectural principle for sparse expert models that directly addresses routing instabilities and enables improved performance with scalable, interpretable specialization.

Paper Structure

This paper contains 21 sections, 12 equations, 9 figures, 8 tables, 2 algorithms.

Figures (9)

  • Figure 1: Overview of the Eigen-Reparameterized Mixture-of-Experts(ERMoE) framework. a) A ViT backbone tokenizes the image; at each ERMoE block, the router computes an eigenbasis score per expert, selects the top-$k$ experts whose scores exceed a threshold $T$, and aggregates their outputs with score-normalized weights for the classification head. b) The details of the Eigenbasis score. For a given expert, the input token and its attention-weighted context are projected into that expert’s eigenbasis; the score is the cosine similarity between the two projections. c) a 3D ViT tokenizes volumetric T1 scans; routing operates over region experts and free experts, and their weighted outputs drive the brain-age estimator.
  • Figure 2: Brain-age (BA) estimation on ADNI test set. Scatter plots show ERMoE-ba predicted BA versus chronological age (CA) for a) males and b) females on the test set. Points are colored by sex (male: blue; female: orange). The solid diagonal denotes the "no error" line ($\mathrm{BA}=\mathrm{CA}$).
  • Figure 3: Experts activation during training (ERMoE). Class–expert routing heatmaps for four ERMoE layers in a ViT-B/16 backbone. Each panel shows the average mixture weight assigned to each expert (x-axis: expert id; y-axis: ImageNet class index $0\!\rightarrow\!1000$). Early layers display broad, overlapping activations; deeper layers sharpen into clearer preferences, indicating healthy specialization without collapse. ERMoE uses thresholded top-$k$ routing with eigenbasis-aligned scores, which curbs noisy assignments and maintains balanced utilization throughout training.
  • Figure 4: Expert balancing on Tiny-ImageNet (test only) and ImageNet. Percentage of tokens routed to each expert after sorting experts by usage (lower is flatter). We compare BASE (test-time greedy), SoftMoE (top-1 of soft combine weights), V-MoE (token-choice with LBL), and ERMoE (ours). ERMoE reduces the long-tail skew observed in V-MoE and the top-1 view of SoftMoE, approaching the near-uniform behavior of BASE without relying on balance-by-construction. Curves are computed on Tiny-ImageNet and ImageNet validation tokens with 128 total experts (16×8).
  • Figure S1: Router selections on region-isolated brain inputs. For each region (WM, GM, CSF), we feed a volume in which only that region is retained and log the experts chosen by the final MoE layer over training. Columns show epochs 1, 5, and 300; each cell lists the top–2 selected experts with their eigenbasis scores.
  • ...and 4 more figures