Table of Contents
Fetching ...

Universal Model Routing for Efficient LLM Inference

Wittawat Jitkrittum, Harikrishna Narasimhan, Ankit Singh Rawat, Jeevesh Juneja, Congchao Wang, Zifeng Wang, Alec Go, Chen-Yu Lee, Pradeep Shenoy, Rina Panigrahy, Aditya Krishna Menon, Sanjiv Kumar

TL;DR

This work addresses the challenge of routing prompts to among a dynamic set of unseen LLMs to minimize inference cost without retraining routers. It introduces UniRoute, a universal routing framework that represents each LLM by a prediction-error-based feature vector and couples it with a prompt representation to learn a cost-aware routing rule that generalizes to unseen models. Two concrete instantiations are proposed: a cluster-based LLM representation (with unsupervised K-means and a learned cluster map) and a general plug-in routing approach using a prediction-error vector; both come with theoretical excess-risk guarantees. Empirical results across diverse benchmarks show UniRoute can effectively route over 30 unseen LLMs, achieving favorable deferral curves and robust performance with limited validation data. The approach offers practical, low-overhead deployment for evolving LLM ecosystems where new models appear frequently.

Abstract

Model routing is a simple technique for reducing the inference cost of large language models (LLMs), wherein one maintains a pool of candidate LLMs, and learns to route each prompt to the smallest feasible LLM. Existing works focus on learning a router for a fixed pool of LLMs. In this paper, we consider the problem of dynamic routing, where new, previously unobserved LLMs are available at test time. We propose UniRoute, a new approach to this problem that relies on representing each LLM as a feature vector, derived based on predictions on a set of representative prompts. Based on this, we detail two effective instantiations of UniRoute, relying on cluster-based routing and a learned cluster map respectively. We show that these are estimates of a theoretically optimal routing rule, and quantify their errors via an excess risk bound. Experiments on a range of public benchmarks show the effectiveness of UniRoute in routing amongst more than 30 unseen LLMs.

Universal Model Routing for Efficient LLM Inference

TL;DR

This work addresses the challenge of routing prompts to among a dynamic set of unseen LLMs to minimize inference cost without retraining routers. It introduces UniRoute, a universal routing framework that represents each LLM by a prediction-error-based feature vector and couples it with a prompt representation to learn a cost-aware routing rule that generalizes to unseen models. Two concrete instantiations are proposed: a cluster-based LLM representation (with unsupervised K-means and a learned cluster map) and a general plug-in routing approach using a prediction-error vector; both come with theoretical excess-risk guarantees. Empirical results across diverse benchmarks show UniRoute can effectively route over 30 unseen LLMs, achieving favorable deferral curves and robust performance with limited validation data. The approach offers practical, low-overhead deployment for evolving LLM ecosystems where new models appear frequently.

Abstract

Model routing is a simple technique for reducing the inference cost of large language models (LLMs), wherein one maintains a pool of candidate LLMs, and learns to route each prompt to the smallest feasible LLM. Existing works focus on learning a router for a fixed pool of LLMs. In this paper, we consider the problem of dynamic routing, where new, previously unobserved LLMs are available at test time. We propose UniRoute, a new approach to this problem that relies on representing each LLM as a feature vector, derived based on predictions on a set of representative prompts. Based on this, we detail two effective instantiations of UniRoute, relying on cluster-based routing and a learned cluster map respectively. We show that these are estimates of a theoretically optimal routing rule, and quantify their errors via an excess risk bound. Experiments on a range of public benchmarks show the effectiveness of UniRoute in routing amongst more than 30 unseen LLMs.

Paper Structure

This paper contains 53 sections, 13 theorems, 54 equations, 8 figures, 2 tables.

Key Result

Proposition 1

Under a mild regularity condition on $\mathbb{P}$, for any input $\boldsymbol{x} \in \mathscr{X}$, LLM candidate set $\mathscr{H} \in \mathbb{H}$, and budget $B > 0$, there exists a Lagrange multiplier $\lambda_{\mathfrak{H}} \ge 0$ such that the optimal dynamic router $r^{*}$ for the constrained op

Figures (8)

  • Figure 1: Illustration of our proposed UniRoute with a cluster-based router (see §\ref{['sec:cluster_router']}). We first perform $K$-means on a training set to find $K$ centroids, and then partition the validation set into $K$ representative clusters. Each test-time LLM can then be represented as a $K$-dimensional feature vector of per-cluster errors. This yields an intuitive routing rule: for each test prompt, we route to the LLM with the smallest cost-adjusted average error on the cluster the prompt belongs to. The prompt embedder may either be completely unsupervised (as shown in the figure), or fitted via supervised learning using labels from a set of training LLMs different from those seen during test time (§\ref{['sec:two_tower']}).
  • Figure 2: Top: We report the area under the deferral curve (up to $50\%$ and $100\%$ cost), and the Quality-Neutral Cost (QNC), i.e., the minimum relative cost to achieve the same performance as the most accurate LLM. For Math+Code, we do not have training LLMs; so we do not report results for UniRoute (LearnedMap). ${ \colorbox{green!20!white}{$^{*}$} }$ indicates the method is statistically significantly worse than UniRoute (LearnedMap) at significance level $\alpha=0.01$ (we compare against K-means for Math+Code). MLP (Clairvoyant) is an oracle that uses the test LLMs for training (provides a performance upper bound). Bottom: Areas under the deferral curve ($\uparrow$) with $96\%$ CI on unseen test LLMs for varying number of validation samples. UniRoute ($K$-means) consistently outperforms $K$-NN for small sample sizes.
  • Figure 3: Deferral curves on EmbedLLM.
  • Figure 4: Validation performance of four methods considered in \ref{['fig:three_experiments']} and \ref{['sec:chatbot_exp_details']}: $K$-NN, UniRoute ($K$-Means), UniRoute ($K$-Means Attributes, and UniRoute (LearnedMap). See \ref{['sec:validate_k']} for more details.
  • Figure 5: Deferral curves and router evaluation metrics (Area (50%), Area, and QNC) for different methods in the dynamic pool setting. MLP Hu:2024b and MatFac OngAlmWu2024ZhuWuWen2024, are oracle methods that observe testing LLMs during training. ZeroRouter HuBieLi2024 and $K$-NN HuBieLi2024Shnitzer:2023 are baselines applicable to the dynamic LLM pool setting.
  • ...and 3 more figures

Theorems & Definitions (24)

  • Proposition 1: Optimal dynamic routing
  • Proposition 2
  • Lemma 3
  • proof
  • Lemma 4
  • proof
  • Lemma 5
  • proof
  • Lemma 6
  • proof
  • ...and 14 more