Table of Contents
Fetching ...

Meta-Router: Bridging Gold-standard and Preference-based Evaluations in Large Language Model Routing

Yichi Zhang, Fangzheng Xie, Shu Yang, Chong Wu

TL;DR

The paper tackles the cost–quality trade-off in large language model routing by integrating scarce gold‑standard data with scalable but biased preference data. It frames the bias between PB and GS evaluations as a conditional average treatment effect (CATE) and develops a causal meta‑learning framework to debias PB data and estimate the GS quality gain function $\\psi(\\cdot)$. Using R‑ and DR‑learners, the approach builds a meta‑router that leverages both data sources, reduces bias, and improves routing robustness and efficiency. Empirical results on HealthBench and PRBench show the proposed meta‑router outperforms baselines, especially with limited GS data, by achieving better cost–quality trade‑offs in professional domains.

Abstract

In language tasks that require extensive human--model interaction, deploying a single "best" model for every query can be expensive. To reduce inference cost while preserving the quality of the responses, a large language model (LLM) router selects the most appropriate model from a pool of candidates for each query. A central challenge to training a high-quality router is the scarcity of reliable supervision. Gold-standard data (e.g., expert-verified labels or rubric-based scores) provide accurate quality evaluations of LLM responses but are costly and difficult to scale. In contrast, preference-based data, collected via crowdsourcing or LLM-as-a-judge systems, are cheaper and more scalable, yet often biased in reflecting the true quality of responses. We cast the problem of LLM router training with combined gold-standard and preference-based data into a causal inference framework by viewing the response evaluation mechanism as the treatment assignment. This perspective further reveals that the bias in preference-based data corresponds to the well-known causal estimand: the conditional average treatment effect. Based on this new perspective, we develop an integrative causal router training framework that corrects preference-data bias, address imbalances between two data sources, and improve routing robustness and efficiency. Numerical experiments demonstrate that our approach delivers more accurate routing and improves the trade-off between cost and quality.

Meta-Router: Bridging Gold-standard and Preference-based Evaluations in Large Language Model Routing

TL;DR

The paper tackles the cost–quality trade-off in large language model routing by integrating scarce gold‑standard data with scalable but biased preference data. It frames the bias between PB and GS evaluations as a conditional average treatment effect (CATE) and develops a causal meta‑learning framework to debias PB data and estimate the GS quality gain function . Using R‑ and DR‑learners, the approach builds a meta‑router that leverages both data sources, reduces bias, and improves routing robustness and efficiency. Empirical results on HealthBench and PRBench show the proposed meta‑router outperforms baselines, especially with limited GS data, by achieving better cost–quality trade‑offs in professional domains.

Abstract

In language tasks that require extensive human--model interaction, deploying a single "best" model for every query can be expensive. To reduce inference cost while preserving the quality of the responses, a large language model (LLM) router selects the most appropriate model from a pool of candidates for each query. A central challenge to training a high-quality router is the scarcity of reliable supervision. Gold-standard data (e.g., expert-verified labels or rubric-based scores) provide accurate quality evaluations of LLM responses but are costly and difficult to scale. In contrast, preference-based data, collected via crowdsourcing or LLM-as-a-judge systems, are cheaper and more scalable, yet often biased in reflecting the true quality of responses. We cast the problem of LLM router training with combined gold-standard and preference-based data into a causal inference framework by viewing the response evaluation mechanism as the treatment assignment. This perspective further reveals that the bias in preference-based data corresponds to the well-known causal estimand: the conditional average treatment effect. Based on this new perspective, we develop an integrative causal router training framework that corrects preference-data bias, address imbalances between two data sources, and improve routing robustness and efficiency. Numerical experiments demonstrate that our approach delivers more accurate routing and improves the trade-off between cost and quality.

Paper Structure

This paper contains 27 sections, 1 theorem, 27 equations, 9 figures.

Key Result

Lemma 1

Define $f_{\mathscr{Q}}$ and $f_{\mathscr{Q}'}$ as density functions of $\mathscr{Q}$ and $\mathscr{Q}'$, respectively. Then the proc:ros is equivalent to the proc:ros2 as follows.

Figures (9)

  • Figure 1: The efficiency gains of different routing strategies compared to the random routing baseline, against the primary model usage ratio in the main numerical experiments. Subfigures correspond to varying GS sample sizes. Colors indicate different methods: oracle benchmark, meta-router via DR-learner, meta-router via R-learner, predictive router using pooled data, predictive router using GS data only, and predictive router using PS data only.
  • Figure 2: The efficiency gains of different routing strategies compared with the random routing baseline versus the primary model usage ratio. All regressions are implemented via XGBoost. Other settings are the same as Figure \ref{['fig:1']}.
  • Figure 3: The efficiency gains of different routing strategies compared with the random routing baseline versus the primary model usage ratio. The setting is same as Figure \ref{['fig:1']}, with an additional curve corresponding to the simple debiased router through linear scaling.
  • Figure 4: The efficiency gains of different routing strategies trained and tested over PRBench in $\mathsection$\ref{['sec:numerical:PR']}. Explanations of subfigures are the same as Figure \ref{['fig:1']}.
  • Figure S1: End-to-End workflow of the meta-router. The training stage only involves the GS-data $\{(q_i, r_i)\}_{i = 1}^n$ and PB-data $\{(q_i', y_i)\}_{i = 1}^m$ and can be carried out completely offline. The inference stage is based on the trained causal meta router $\widehat{\psi}(\cdot)$ and runs online with generic incoming new queries.
  • ...and 4 more figures

Theorems & Definitions (4)

  • Remark 1
  • Remark 2
  • Lemma 1
  • Remark 3: Computational Cost