Towards Fair and Comprehensive Evaluation of Routers in Collaborative LLM Systems

Wanxing Wu; He Zhu; Yixia Li; Lei Yang; Jiehui Zhao; Hongru Wang; Jian Yang; Benyou Wang; Bingyi Jing; Guanhua Chen

Towards Fair and Comprehensive Evaluation of Routers in Collaborative LLM Systems

Wanxing Wu, He Zhu, Yixia Li, Lei Yang, Jiehui Zhao, Hongru Wang, Jian Yang, Benyou Wang, Bingyi Jing, Guanhua Chen

TL;DR

This work introduces RouterXBench, a principled framework for evaluating routers in edge–cloud LLM collaboration across three dimensions: intrinsic router ability, deployment-scenario alignment, and cross-domain robustness. It proposes ProbeDirichlet, a cross-layer hidden-state router that aggregates layer information via a learned Dirichlet distribution, trained on multi-domain data to generalize to ID and OOD tasks. The results show substantial improvements in router ability (AUROC) and high-accuracy scenario performance (HCR) over baselines, with strong cross-model generalization and applicability to agent-based inference. The study emphasizes data diversity as a key driver of robustness, offering practical guidance for designing cost-efficient, private, and reliable collaborative LLM systems.

Abstract

Large language models (LLMs) have achieved success, but cost and privacy constraints necessitate deploying smaller models locally while offloading complex queries to cloud-based models. Existing router evaluations are unsystematic, overlooking scenario-specific requirements and out-of-distribution robustness. We propose RouterXBench, a principled evaluation framework with three dimensions: router ability, scenario alignment, and cross-domain robustness. Unlike prior work that relies on output probabilities or external embeddings, we utilize internal hidden states that capture model uncertainty before answer generation. We introduce ProbeDirichlet, a lightweight router that aggregates cross-layer hidden states via learnable Dirichlet distributions with probabilistic training. Trained on multi-domain data, it generalizes robustly across in-domain and out-of-distribution scenarios. Our results show ProbeDirichlet achieves 16.68% and 18.86% relative improvements over the best baselines in router ability and high-accuracy scenarios, with consistent performance across model families, model scales, heterogeneous tasks, and agentic workflows.

Towards Fair and Comprehensive Evaluation of Routers in Collaborative LLM Systems

TL;DR

Abstract

Paper Structure (45 sections, 18 equations, 7 figures, 7 tables, 1 algorithm)

This paper contains 45 sections, 18 equations, 7 figures, 7 tables, 1 algorithm.

Introduction
Related Work
LLM Routing.
LLM Collaboration.
LLM Uncertainty Estimation.
Evaluation Framework
Problem Setup
Limitations of Current Metrics
Static Metrics.
Curve-based Metrics.
Triple-Perspective Framework
1. Router Ability.
2. Scenario Alignment.
3. Cross-Domain Robustness
Methodology
...and 30 more sections

Figures (7)

Figure 1: Left: Cost--performance mapping where $d(\theta)$ represents the call rate at threshold $\theta$ and $\text{Perf}(\theta)$ denotes overall performance. By varying $\theta$, this can be re-parameterized as call rate vs. performance (see §\ref{['sec:metrics_definition']}). Right: An illustrative limitation of existing metrics.
Figure 2: Overview of the ProbeDirichlet router and RouterXBench evaluation framework. Router ability is quantified using AUROC, measuring the router's accuracy in predicting whether the SLM can answer correctly. Scenario alignment is evaluated across three call-rate regimes: low band (Low-band Performance Mean, LPM), mid band (Mid-band Performance Mean, MPM), and high band (High-band Call-Rate, HCR).
Figure 3: Effect of probe complexity on performance and generalization. The horizontal line represents the Linear Probe baseline, serving as a constant reference independent of the hidden dimension axis.
Figure 4: Validation AUROC (%) across training scales for single-dataset and mixed-dataset probes. Low/Mid/High denote 1K/4K/8K samples per dataset for single-dataset training, and 3K/12K/24K total samples for mixed training.
Figure 5: Cost-Performance curve under the agent-based inference scenario on HotpotQA.
...and 2 more figures

Towards Fair and Comprehensive Evaluation of Routers in Collaborative LLM Systems

TL;DR

Abstract

Towards Fair and Comprehensive Evaluation of Routers in Collaborative LLM Systems

Authors

TL;DR

Abstract

Table of Contents

Figures (7)