Table of Contents
Fetching ...

LLM Bandit: Cost-Efficient LLM Generation via Preference-Conditioned Dynamic Routing

Yang Li

TL;DR

This paper addresses cost-aware LLM deployment by framing model selection as a multi-objective contextual routing problem. It introduces model identity vectors learned via a variational IRT framework and a preference-conditioned routing policy that generalizes across arbitrary sets of models and user costs, enabling dynamic trade-offs at inference time. The approach supports efficient onboarding of new models (20–50 prompts) and demonstrates up to 27% cost reductions while maintaining or improving accuracy across multiple benchmarks. With a scalable, permutation-invariant routing mechanism and robust generalization techniques, the framework offers a practical pathway to deploy diverse LLM ecosystems in cost-sensitive real-world settings.

Abstract

The rapid advancement in large language models (LLMs) has brought forth a diverse range of models with varying capabilities that excel in different tasks and domains. However, selecting the optimal LLM for user queries often involves a challenging trade-off between accuracy and cost, a problem exacerbated by the diverse demands of individual queries. In this work, we present a novel framework that formulates the LLM selection process as a multi-armed bandit problem, enabling dynamic and intelligent routing of queries to the most appropriate model. Our approach incorporates a preference-conditioned dynamic routing mechanism, allowing users to specify their preferences at inference time, thereby offering a customizable balance between performance and cost. Additionally, our selection policy is designed to generalize to unseen LLMs, ensuring adaptability to new models as they emerge. Experimental results demonstrate that our method achieves significant improvements in both accuracy and cost-effectiveness across various LLM platforms, showcasing the potential of our framework to adaptively optimize LLM selection in real-world scenarios.

LLM Bandit: Cost-Efficient LLM Generation via Preference-Conditioned Dynamic Routing

TL;DR

This paper addresses cost-aware LLM deployment by framing model selection as a multi-objective contextual routing problem. It introduces model identity vectors learned via a variational IRT framework and a preference-conditioned routing policy that generalizes across arbitrary sets of models and user costs, enabling dynamic trade-offs at inference time. The approach supports efficient onboarding of new models (20–50 prompts) and demonstrates up to 27% cost reductions while maintaining or improving accuracy across multiple benchmarks. With a scalable, permutation-invariant routing mechanism and robust generalization techniques, the framework offers a practical pathway to deploy diverse LLM ecosystems in cost-sensitive real-world settings.

Abstract

The rapid advancement in large language models (LLMs) has brought forth a diverse range of models with varying capabilities that excel in different tasks and domains. However, selecting the optimal LLM for user queries often involves a challenging trade-off between accuracy and cost, a problem exacerbated by the diverse demands of individual queries. In this work, we present a novel framework that formulates the LLM selection process as a multi-armed bandit problem, enabling dynamic and intelligent routing of queries to the most appropriate model. Our approach incorporates a preference-conditioned dynamic routing mechanism, allowing users to specify their preferences at inference time, thereby offering a customizable balance between performance and cost. Additionally, our selection policy is designed to generalize to unseen LLMs, ensuring adaptability to new models as they emerge. Experimental results demonstrate that our method achieves significant improvements in both accuracy and cost-effectiveness across various LLM platforms, showcasing the potential of our framework to adaptively optimize LLM selection in real-world scenarios.

Paper Structure

This paper contains 39 sections, 2 theorems, 23 equations, 5 figures, 5 tables.

Key Result

Theorem 1.1

If the policy $\pi_\theta(k|x)$ is continuous in $\theta$ for all $x$ and $k$, then the expected reward $J(\theta) = \mathbb{E}_{x \sim p(x), k \sim \pi_\theta(x)}[s(x,k)]$ is continuous in $\theta$.

Figures (5)

  • Figure 1: Overview of our preference-conditioned dynamic routing framework. Model quizzing (left) generates identity vectors capturing model capabilities, while routing policy (right) determines model selection based on user preferences and query.
  • Figure 2: Evaluate the routing performance across 5 datasets and various sets of LLM candidates.
  • Figure 3: Evaluate routing performance on two sets of new models. the identity vectors are obtained using 10, 20 or 50 selected prompts, respectively.
  • Figure 4: Ablation studies on the routing policy components.
  • Figure 5.1: Performance-cost trade-off on MTBench dataset.

Theorems & Definitions (3)

  • Theorem 1.1
  • proof
  • Corollary 1.2