RACER: Risk-Aware Calibrated Efficient Routing for Large Language Models

Sai Hao; Hao Zeng; Hongxin Wei; Bingyi Jing

RACER: Risk-Aware Calibrated Efficient Routing for Large Language Models

Sai Hao, Hao Zeng, Hongxin Wei, Bingyi Jing

TL;DR

This work formulate LLM routing as the $\alpha$-VOR problem to minimize expected set size while controlling the misrouting risk, and proposes a novel method -- RACER, extending base routers to output model sets that can be subsequently aggregated for improved output.

Abstract

Efficiently routing queries to the optimal large language model (LLM) is crucial for optimizing the cost-performance trade-off in multi-model systems. However, most existing routers rely on single-model selection, making them susceptible to misrouting. In this work, we formulate LLM routing as the $α$-VOR problem to minimize expected set size while controlling the misrouting risk, and propose a novel method -- RACER, extending base routers to output model sets that can be subsequently aggregated for improved output. In particular, RACER constructs nested model sets via augmented scoring and utilizes finite-sample concentration bounds to calibrate a threshold that allows for both variable set sizes and abstention. We theoretically prove that RACER achieves rigorous distribution-free risk control on unseen test data in a post-hoc and model-agnostic manner. Extensive experiments verify our theoretical guarantees and demonstrate that RACER consistently enhances downstream accuracy across a wide range of benchmarks.

RACER: Risk-Aware Calibrated Efficient Routing for Large Language Models

TL;DR

This work formulate LLM routing as the

-VOR problem to minimize expected set size while controlling the misrouting risk, and proposes a novel method -- RACER, extending base routers to output model sets that can be subsequently aggregated for improved output.

Abstract

-VOR problem to minimize expected set size while controlling the misrouting risk, and propose a novel method -- RACER, extending base routers to output model sets that can be subsequently aggregated for improved output. In particular, RACER constructs nested model sets via augmented scoring and utilizes finite-sample concentration bounds to calibrate a threshold that allows for both variable set sizes and abstention. We theoretically prove that RACER achieves rigorous distribution-free risk control on unseen test data in a post-hoc and model-agnostic manner. Extensive experiments verify our theoretical guarantees and demonstrate that RACER consistently enhances downstream accuracy across a wide range of benchmarks.

Paper Structure (63 sections, 5 theorems, 46 equations, 8 figures, 2 tables, 1 algorithm)

This paper contains 63 sections, 5 theorems, 46 equations, 8 figures, 2 tables, 1 algorithm.

Introduction
Background
Preliminaries
Multi-model routing.
Problem formulation
RACER
Augmented scoring and set construction
Risk calibration
Loss function.
Threshold selection.
Inference and response aggregation
Majority voting.
Weighted aggregation.
Theoretical analysis
Nestedness and monotonicity
...and 48 more sections

Key Result

Lemma 4.1

For any query $\bm{x}\in\mathcal{X}$, the prediction model sets $\{C_\lambda(\bm{x})\}_{\lambda\in\mathbb{R}}$ defined in Eq. eq:prediction_set_def form a nested family. That is, for any $\lambda_1\le \lambda_2$,

Figures (8)

Figure 1: Overview of the RACER paradigm. RACER operates in two phases. Risk Calibration (Left): The calibration module uses a labeled dataset $\mathcal{D}_{\mathrm{cal}}$ and a user-specified risk level $\alpha$. It augments the standard model space $\mathcal{M}$ with a null model $m_\emptyset$ to construct augmented ground truth set $G'$. The threshold $\hat{\lambda}$ is then computed to guarantee risk control. Model Routing (Right): During inference, the paradigm applies the calibrated $\hat{\lambda}$ to the augmented scores of a test query $\bm{x}$. This generates a prediction set $C_{\hat{\lambda}}(\bm{x})$. If the set contains only the null model, the system triggers abstention; otherwise, it proceeds to Response Aggregation, where the outputs of the selected standard LLMs are combined via majority voting or weighted aggregation to produce the final prediction $\hat{y}$.
Figure 2: Distributions of risk and size for RACER on CMMLU over 100 independent trials with a target risk level $\alpha=0.1$.Left: The distribution of risk, where the black dashed line represents the user-specified risk level. Results demonstrate that RACER consistently maintains the risk below the target $\alpha$ for all base routers and non-conformity scores. Right: The distribution of prediction set size. The green and orange boxes represent the router score-gap and inverse irobability non-conformity scores, respectively.
Figure 3: Trade-off between computational efficiency and performance. The scatter plot illustrates the reduction in inference overhead (Model Calls Saved) versus the absolute improvement in test accuracy (Accuracy Gain) for RACER relative to the full model ensemble. The concentration of data points in the upper-right region indicates that RACER effectively filters noise to achieve significant savings (up to 58.6%) while simultaneously improving accuracy (up to +4.49%) across various dataset-router configurations.
Figure 4: The prompt template used for extracting model confidence. The placeholders {question} and {answer} are replaced by the input query $\bm{x}_{n+1}$ and the model's generated response $a_m$, respectively.
Figure 5: Distributions of empirical Risk and Size over 100 independent trials with a target risk level $\alpha=0.1$. The top row displays the risk distribution, where the vertical black dashed line indicates the target risk level $\alpha$. The bottom row illustrates the distribution of prediction set sizes. The green bars represent the router score-gap non-conformity score, while the orange bars represent the inverse probability non-conformity score. The results empirically demonstrate that RACER strictly controls the risk around the target level across different base routers and benchmarks.
...and 3 more figures

Theorems & Definitions (13)

Definition 2.1: $\alpha$-VOR
Remark 2.2: Interpretation of validity
Lemma 4.1: Nestedness
Lemma 4.2: Monotonicity, right-continuity, and boundedness
Theorem 4.3: Risk control
Remark 4.4
Theorem 4.5: Risk lower bound
proof
proof
proof
...and 3 more

RACER: Risk-Aware Calibrated Efficient Routing for Large Language Models

TL;DR

Abstract

RACER: Risk-Aware Calibrated Efficient Routing for Large Language Models

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (8)

Theorems & Definitions (13)