Table of Contents
Fetching ...

Leveraging Uncertainty Estimation for Efficient LLM Routing

Tuo Zhang, Asal Mehradfar, Dimitrios Dimitriadis, Salman Avestimehr

TL;DR

This work introduces a Confidence-Driven LLM Router that leverages semantic entropy, a meaning-level uncertainty measure, to decide when to offload queries to cloud LLMs in edge-cloud deployments. By clustering paraphrase outputs and framing routing as an uncertainty-minimization problem, the method avoids reliance on brittle human preferences or only accuracy-based signals. The training pipeline uses three phases to generate SE-based preference data and then trains lightweight routers; evaluation against state-of-the-art baselines on MT-Bench, GSM8K, and MMLU shows improved response quality (as judged by LLMs) and reduced system cost. The approach offers a practical, scalable routing solution with strong performance, and highlights future directions toward multi-modal routing and latency-aware deployment.

Abstract

Deploying large language models (LLMs) in edge-cloud environments requires an efficient routing strategy to balance cost and response quality. Traditional approaches prioritize either human-preference data or accuracy metrics from benchmark datasets as routing criteria, but these methods suffer from rigidity and subjectivity. Moreover, existing routing frameworks primarily focus on accuracy and cost, neglecting response quality from a human preference perspective. In this work, we propose the Confidence-Driven LLM Router, a novel framework that leverages uncertainty estimation to optimize routing decisions. To comprehensively assess routing performance, we evaluate both system cost efficiency and response quality. In particular, we introduce the novel use of LLM-as-a-Judge to simulate human rating preferences, providing the first systematic assessment of response quality across different routing strategies. Extensive experiments on MT-Bench, GSM8K, and MMLU demonstrate that our approach outperforms state-of-the-art routing methods, achieving superior response quality while maintaining cost efficiency.

Leveraging Uncertainty Estimation for Efficient LLM Routing

TL;DR

This work introduces a Confidence-Driven LLM Router that leverages semantic entropy, a meaning-level uncertainty measure, to decide when to offload queries to cloud LLMs in edge-cloud deployments. By clustering paraphrase outputs and framing routing as an uncertainty-minimization problem, the method avoids reliance on brittle human preferences or only accuracy-based signals. The training pipeline uses three phases to generate SE-based preference data and then trains lightweight routers; evaluation against state-of-the-art baselines on MT-Bench, GSM8K, and MMLU shows improved response quality (as judged by LLMs) and reduced system cost. The approach offers a practical, scalable routing solution with strong performance, and highlights future directions toward multi-modal routing and latency-aware deployment.

Abstract

Deploying large language models (LLMs) in edge-cloud environments requires an efficient routing strategy to balance cost and response quality. Traditional approaches prioritize either human-preference data or accuracy metrics from benchmark datasets as routing criteria, but these methods suffer from rigidity and subjectivity. Moreover, existing routing frameworks primarily focus on accuracy and cost, neglecting response quality from a human preference perspective. In this work, we propose the Confidence-Driven LLM Router, a novel framework that leverages uncertainty estimation to optimize routing decisions. To comprehensively assess routing performance, we evaluate both system cost efficiency and response quality. In particular, we introduce the novel use of LLM-as-a-Judge to simulate human rating preferences, providing the first systematic assessment of response quality across different routing strategies. Extensive experiments on MT-Bench, GSM8K, and MMLU demonstrate that our approach outperforms state-of-the-art routing methods, achieving superior response quality while maintaining cost efficiency.

Paper Structure

This paper contains 19 sections, 12 equations, 2 figures, 2 tables.

Figures (2)

  • Figure 1: Performance of human preference-based router with varying training sample sizes. Routing efficiency even becomes worse as the number of training samples increases, indicating that additional data does not necessarily improve performance.
  • Figure 2: Routing performance/cost trade-off between strong model (GPT-4) and weak model (Mixtral-8x7B). All routers shown, except the random router, use the same kNN-based model architecture.