Table of Contents
Fetching ...

Cost-Effective Online Multi-LLM Selection with Versatile Reward Models

Xiangxiang Dai, Jin Li, Xutong Liu, Anqi Yu, John C. S. Lui

TL;DR

C2MAB-V addresses online, cost-aware selection of multiple LLMs by casting the problem as a combinatorial bandit with versatile reward models (AWC, SUC, AIC) under a long-term budget. It uses a two-tier architecture (local server for online estimates and cloud for solving a relaxed continuous problem and discretization rounding) to tackle NP-hard combinatorial selection, offering sublinear regret and diminishing budget violations. The framework is theoretically grounded and empirically validated with nine LLMs over three task types, demonstrating favorable reward-cost trade-offs and rapid convergence. This approach enables scalable, task-structured, cost-conscious LLM orchestration suitable for real-world deployment with privacy-preserving, online adaptation capabilities.

Abstract

With the rapid advancement of large language models (LLMs), the diversity of multi-LLM tasks and the variability in their pricing structures have become increasingly important, as costs can vary greatly between different LLMs. To tackle these challenges, we introduce the \textit{C2MAB-V}, a \underline{C}ost-effective \underline{C}ombinatorial \underline{M}ulti-armed \underline{B}andit with \underline{V}ersatile reward models for optimal LLM selection and usage. This online model differs from traditional static approaches or those reliant on a single LLM without cost consideration. With multiple LLMs deployed on a scheduling cloud and a local server dedicated to handling user queries, \textit{C2MAB-V} facilitates the selection of multiple LLMs over a combinatorial search space, specifically tailored for various collaborative task types with different reward models. Based on our designed online feedback mechanism and confidence bound technique, \textit{C2MAB-V} can effectively address the multi-LLM selection challenge by managing the exploration-exploitation trade-off across different models, while also balancing cost and reward for diverse tasks. The NP-hard integer linear programming problem for selecting multiple LLMs with trade-off dilemmas is addressed by: i) decomposing the integer problem into a relaxed form by the local server, ii) utilizing a discretization rounding scheme that provides optimal LLM combinations by the scheduling cloud, and iii) continual online updates based on feedback. Theoretically, we prove that \textit{C2MAB-V} offers strict guarantees over versatile reward models, matching state-of-the-art results for regret and violations in some degenerate cases. Empirically, we show that \textit{C2MAB-V} effectively balances performance and cost-efficiency with nine LLMs for three application scenarios.

Cost-Effective Online Multi-LLM Selection with Versatile Reward Models

TL;DR

C2MAB-V addresses online, cost-aware selection of multiple LLMs by casting the problem as a combinatorial bandit with versatile reward models (AWC, SUC, AIC) under a long-term budget. It uses a two-tier architecture (local server for online estimates and cloud for solving a relaxed continuous problem and discretization rounding) to tackle NP-hard combinatorial selection, offering sublinear regret and diminishing budget violations. The framework is theoretically grounded and empirically validated with nine LLMs over three task types, demonstrating favorable reward-cost trade-offs and rapid convergence. This approach enables scalable, task-structured, cost-conscious LLM orchestration suitable for real-world deployment with privacy-preserving, online adaptation capabilities.

Abstract

With the rapid advancement of large language models (LLMs), the diversity of multi-LLM tasks and the variability in their pricing structures have become increasingly important, as costs can vary greatly between different LLMs. To tackle these challenges, we introduce the \textit{C2MAB-V}, a \underline{C}ost-effective \underline{C}ombinatorial \underline{M}ulti-armed \underline{B}andit with \underline{V}ersatile reward models for optimal LLM selection and usage. This online model differs from traditional static approaches or those reliant on a single LLM without cost consideration. With multiple LLMs deployed on a scheduling cloud and a local server dedicated to handling user queries, \textit{C2MAB-V} facilitates the selection of multiple LLMs over a combinatorial search space, specifically tailored for various collaborative task types with different reward models. Based on our designed online feedback mechanism and confidence bound technique, \textit{C2MAB-V} can effectively address the multi-LLM selection challenge by managing the exploration-exploitation trade-off across different models, while also balancing cost and reward for diverse tasks. The NP-hard integer linear programming problem for selecting multiple LLMs with trade-off dilemmas is addressed by: i) decomposing the integer problem into a relaxed form by the local server, ii) utilizing a discretization rounding scheme that provides optimal LLM combinations by the scheduling cloud, and iii) continual online updates based on feedback. Theoretically, we prove that \textit{C2MAB-V} offers strict guarantees over versatile reward models, matching state-of-the-art results for regret and violations in some degenerate cases. Empirically, we show that \textit{C2MAB-V} effectively balances performance and cost-efficiency with nine LLMs for three application scenarios.
Paper Structure (25 sections, 8 theorems, 32 equations, 14 figures, 4 tables, 3 algorithms)

This paper contains 25 sections, 8 theorems, 32 equations, 14 figures, 4 tables, 3 algorithms.

Key Result

Lemma 1

For each round $t$ and LLM $k \in \mathcal{K}$, define $\mathcal{N}_{\mu}$ as the event where $\left|\hat{\mu}_{t, k}-\mu_k\right| <\rho_{t,\mu_k}$, and $\mathcal{N}_c$ as $\left|\hat{c}_{t, k}-c_k\right|<\rho_{t,c_k}$. Then, the probability of $\mathcal{N}_{\mu},\mathcal{N}_c$ occurring is at least

Figures (14)

  • Figure 1: Accuracy of different LLMs across varied problem samples.
  • Figure 2: Simple example of combinatorial LLMs in a cascading form.
  • Figure 3: Design of C2MAB-V workflow, with detailed process descriptions provided on the left main text.
  • Figure 4: Reward/violation ratio of three task types with nine different LLMs.
  • Figure 5: Sample conversation with LLM on biology, chemistry, geography, and physics.
  • ...and 9 more figures

Theorems & Definitions (11)

  • Lemma 1
  • Theorem 1: Regret Bound
  • Remark 1
  • Theorem 2: Violation Bound
  • Remark 2
  • Lemma 2: Theorem 2.1 in chekuri2009dependent
  • Lemma 3: Theorem 1 in sun2023simple
  • proof
  • Lemma 4: Subgaussian random variables
  • Lemma 5: Chernoff-Hoeffding inequality dubhashi2009concentration.
  • ...and 1 more