RACER: Risk-Aware Calibrated Efficient Routing for Large Language Models
Sai Hao, Hao Zeng, Hongxin Wei, Bingyi Jing
TL;DR
This work formulate LLM routing as the $\alpha$-VOR problem to minimize expected set size while controlling the misrouting risk, and proposes a novel method -- RACER, extending base routers to output model sets that can be subsequently aggregated for improved output.
Abstract
Efficiently routing queries to the optimal large language model (LLM) is crucial for optimizing the cost-performance trade-off in multi-model systems. However, most existing routers rely on single-model selection, making them susceptible to misrouting. In this work, we formulate LLM routing as the $α$-VOR problem to minimize expected set size while controlling the misrouting risk, and propose a novel method -- RACER, extending base routers to output model sets that can be subsequently aggregated for improved output. In particular, RACER constructs nested model sets via augmented scoring and utilizes finite-sample concentration bounds to calibrate a threshold that allows for both variable set sizes and abstention. We theoretically prove that RACER achieves rigorous distribution-free risk control on unseen test data in a post-hoc and model-agnostic manner. Extensive experiments verify our theoretical guarantees and demonstrate that RACER consistently enhances downstream accuracy across a wide range of benchmarks.
