Table of Contents
Fetching ...

Is Retraining-Free Enough? The Necessity of Router Calibration for Efficient MoE Compression

Sieun Hyeon, Jaeyoung Do

TL;DR

It is argued that effective retraining-free compression should avoid updating expert parameters while allowing lightweight router calibration, and proposed Router Knowledge Distillation (Router KD), which updates only a tiny fraction of parameters by distilling the original model's next-token distribution on unlabeled calibration data.

Abstract

Mixture-of-Experts (MoE) models scale capacity efficiently, but their massive parameter footprint creates a deployment-time memory bottleneck. We organize retraining-free MoE compression into three paradigms - Expert Pruning, Expert Editing, and Expert Merging - and show that persistent post-compression degradation largely stems from a neglected factor: router-expert mismatch when experts are changed but the router is left untouched. We argue that effective retraining-free compression should avoid updating expert parameters while allowing lightweight router calibration. To this end, we propose Router Knowledge Distillation (Router KD), which updates only a tiny fraction of parameters (the router) by distilling the original model's next-token distribution on unlabeled calibration data. Experiments across representative methods in all three paradigms demonstrate consistent performance recovery, with substantially larger gains in fine-grained MoEs (many small experts) than in coarse-grained MoEs due to their more complex routing decision boundaries.

Is Retraining-Free Enough? The Necessity of Router Calibration for Efficient MoE Compression

TL;DR

It is argued that effective retraining-free compression should avoid updating expert parameters while allowing lightweight router calibration, and proposed Router Knowledge Distillation (Router KD), which updates only a tiny fraction of parameters by distilling the original model's next-token distribution on unlabeled calibration data.

Abstract

Mixture-of-Experts (MoE) models scale capacity efficiently, but their massive parameter footprint creates a deployment-time memory bottleneck. We organize retraining-free MoE compression into three paradigms - Expert Pruning, Expert Editing, and Expert Merging - and show that persistent post-compression degradation largely stems from a neglected factor: router-expert mismatch when experts are changed but the router is left untouched. We argue that effective retraining-free compression should avoid updating expert parameters while allowing lightweight router calibration. To this end, we propose Router Knowledge Distillation (Router KD), which updates only a tiny fraction of parameters (the router) by distilling the original model's next-token distribution on unlabeled calibration data. Experiments across representative methods in all three paradigms demonstrate consistent performance recovery, with substantially larger gains in fine-grained MoEs (many small experts) than in coarse-grained MoEs due to their more complex routing decision boundaries.
Paper Structure (80 sections, 75 equations, 4 figures, 6 tables)

This paper contains 80 sections, 75 equations, 4 figures, 6 tables.

Figures (4)

  • Figure 1: Illustration of the three MoE compression paradigms. The diagram depicts (A) Expert Pruning (selection), (B) Expert Editing (decomposition or modification), and (C) Expert Merging (aggregation) to reduce model size.
  • Figure 2: Layer-wise Analysis of Router Behavior on Qwen3-30B-A3B-Instruct-2507(128 Experts). Results are based on 100 samples from the ELI5 dataset. (a) shows the L1 distance of routing probabilities, where Pruning methods exhibit larger divergence compared to Editing and Merging. (b) illustrates the Top-8 expert overlap ratio, indicating that router consistency degrades in deeper layers across all compression methods.
  • Figure 3: Performance Impact of Router KD on Qwen3 vs. Mixtral. The chart compares performance recovery across Pruning, Editing, and Merging, where green bars indicate improved benchmarks. Router KD proves significantly more effective for the fine-grained Qwen3 (Left) compared to the coarse-grained Mixtral (Right), which shows limited gains due to its simpler routing decision boundaries.
  • Figure 4: Robustness of Router KD under Milder Compression (75% Retention). Average benchmark scores for Qwen3 when retaining 75% of Expert parameters. Comparing standard compression against Router KD (suffix -R) relative to the original model's mean (red line) confirms that Router KD consistently recovers performance across different compression ratios.