Table of Contents
Fetching ...

C3PO: Critical-Layer, Core-Expert, Collaborative Pathway Optimization for Test-Time Expert Re-Mixing

Zhongyang Li, Ziyue Li, Tianyi Zhou

TL;DR

This paper addresses the sub-optimal expert pathways in mixture-of-experts LLMs by introducing C3PO, a test-time framework that re-mixes routing weights per test sample using neighbor-based surrogates. It presents three surrogate strategies—mode finding, kernel regression, and neighborhood-guided losses—plus an efficient focus on critical layers and core experts to reduce compute. Empirical results across six benchmarks show 7-15% accuracy gains, with 1-3B active-parameter MoE variants outperforming larger dense models, highlighting the method’s efficiency and effectiveness. The work demonstrates that dynamic, sample-specific routing optimization can unlock substantial performance gains without parametric fine-tuning, suggesting a practical path to more capable and efficient MoE LLMs.

Abstract

Mixture-of-Experts (MoE) Large Language Models (LLMs) suffer from severely sub-optimal expert pathways-our study reveals that naive expert selection learned from pretraining leaves a surprising 10-20% accuracy gap for improvement. Motivated by this observation, we develop a novel class of test-time optimization methods to re-weight or "re-mixing" the experts in different layers jointly for each test sample. Since the test sample's ground truth is unknown, we propose to optimize a surrogate objective defined by the sample's "successful neighbors" from a reference set of samples. We introduce three surrogates and algorithms based on mode-finding, kernel regression, and the average loss of similar reference samples/tasks. To reduce the cost of optimizing whole pathways, we apply our algorithms merely to the core experts' mixing weights in critical layers, which enjoy similar performance but save significant computation. This leads to "Critical-Layer, Core-Expert, Collaborative Pathway Optimization (C3PO)". We apply C3PO to two recent MoE LLMs and examine it on six widely-used benchmarks. It consistently improves the base model by 7-15% in accuracy and outperforms widely used test-time learning baselines, e.g., in-context learning and prompt/prefix tuning, by a large margin. Moreover, C3PO enables MoE LLMs with 1-3B active parameters to outperform LLMs of 7-9B parameters, hence improving MoE's advantages on efficiency. Our thorough ablation study further sheds novel insights on achieving test-time improvement on MoE.

C3PO: Critical-Layer, Core-Expert, Collaborative Pathway Optimization for Test-Time Expert Re-Mixing

TL;DR

This paper addresses the sub-optimal expert pathways in mixture-of-experts LLMs by introducing C3PO, a test-time framework that re-mixes routing weights per test sample using neighbor-based surrogates. It presents three surrogate strategies—mode finding, kernel regression, and neighborhood-guided losses—plus an efficient focus on critical layers and core experts to reduce compute. Empirical results across six benchmarks show 7-15% accuracy gains, with 1-3B active-parameter MoE variants outperforming larger dense models, highlighting the method’s efficiency and effectiveness. The work demonstrates that dynamic, sample-specific routing optimization can unlock substantial performance gains without parametric fine-tuning, suggesting a practical path to more capable and efficient MoE LLMs.

Abstract

Mixture-of-Experts (MoE) Large Language Models (LLMs) suffer from severely sub-optimal expert pathways-our study reveals that naive expert selection learned from pretraining leaves a surprising 10-20% accuracy gap for improvement. Motivated by this observation, we develop a novel class of test-time optimization methods to re-weight or "re-mixing" the experts in different layers jointly for each test sample. Since the test sample's ground truth is unknown, we propose to optimize a surrogate objective defined by the sample's "successful neighbors" from a reference set of samples. We introduce three surrogates and algorithms based on mode-finding, kernel regression, and the average loss of similar reference samples/tasks. To reduce the cost of optimizing whole pathways, we apply our algorithms merely to the core experts' mixing weights in critical layers, which enjoy similar performance but save significant computation. This leads to "Critical-Layer, Core-Expert, Collaborative Pathway Optimization (C3PO)". We apply C3PO to two recent MoE LLMs and examine it on six widely-used benchmarks. It consistently improves the base model by 7-15% in accuracy and outperforms widely used test-time learning baselines, e.g., in-context learning and prompt/prefix tuning, by a large margin. Moreover, C3PO enables MoE LLMs with 1-3B active parameters to outperform LLMs of 7-9B parameters, hence improving MoE's advantages on efficiency. Our thorough ablation study further sheds novel insights on achieving test-time improvement on MoE.

Paper Structure

This paper contains 42 sections, 9 equations, 8 figures, 14 tables.

Figures (8)

  • Figure 1: Comparison of OLMoE-1B-7B (1B activated parameters) with C3PO against multiple 7B dense models across six benchmarks. C3PO improves OLMoE-1B-7B's accuracy by 7-15%, outperforming 7B models over all benchmarks, validating the efficiency of MoE architecture and C3PO's optimization effectiveness.
  • Figure 2: Pathway optimization in C3PO. For a test sample, C3PO retrieves successful pathways (green arrows) from similar samples in the reference set and adjusts the initial pathway (red arrow) based on them to achieve better prediction.
  • Figure 3: Analysis of critical layers in OLMoE (F: early layers, M: middle layers, L: late layers). Optimizing only the last five layers (L5) achieves the highest accuracy, outperforming full-layer optimization (All16) and partial combinations (e.g., F2L3).
  • Figure 4: Accuracy-FLOPs Trade-off by changing the number of core experts ($n$) of OLMoE to optimize by C3PO. The accuracy achieves the greatest boosting at $n=8$ and plateaus at $n=20$, indicating 8-20 core experts suffices to retain most gain by pathway optimization.
  • Figure 5: Average percentage of the top-8 experts (after optimizing all experts) being retained in the top-$n$ experts identified by pretrained router in OLMoE. The results indicate that selecting $n\geq20$ in advance can effectively cover almost all the 8 core experts contributing to performance.
  • ...and 3 more figures