Cost-Effective Online Multi-LLM Selection with Versatile Reward Models

Xiangxiang Dai; Jin Li; Xutong Liu; Anqi Yu; John C. S. Lui

Cost-Effective Online Multi-LLM Selection with Versatile Reward Models

Xiangxiang Dai, Jin Li, Xutong Liu, Anqi Yu, John C. S. Lui

TL;DR

C2MAB-V addresses online, cost-aware selection of multiple LLMs by casting the problem as a combinatorial bandit with versatile reward models (AWC, SUC, AIC) under a long-term budget. It uses a two-tier architecture (local server for online estimates and cloud for solving a relaxed continuous problem and discretization rounding) to tackle NP-hard combinatorial selection, offering sublinear regret and diminishing budget violations. The framework is theoretically grounded and empirically validated with nine LLMs over three task types, demonstrating favorable reward-cost trade-offs and rapid convergence. This approach enables scalable, task-structured, cost-conscious LLM orchestration suitable for real-world deployment with privacy-preserving, online adaptation capabilities.

Abstract

With the rapid advancement of large language models (LLMs), the diversity of multi-LLM tasks and the variability in their pricing structures have become increasingly important, as costs can vary greatly between different LLMs. To tackle these challenges, we introduce the \textit{C2MAB-V}, a \underline{C}ost-effective \underline{C}ombinatorial \underline{M}ulti-armed \underline{B}andit with \underline{V}ersatile reward models for optimal LLM selection and usage. This online model differs from traditional static approaches or those reliant on a single LLM without cost consideration. With multiple LLMs deployed on a scheduling cloud and a local server dedicated to handling user queries, \textit{C2MAB-V} facilitates the selection of multiple LLMs over a combinatorial search space, specifically tailored for various collaborative task types with different reward models. Based on our designed online feedback mechanism and confidence bound technique, \textit{C2MAB-V} can effectively address the multi-LLM selection challenge by managing the exploration-exploitation trade-off across different models, while also balancing cost and reward for diverse tasks. The NP-hard integer linear programming problem for selecting multiple LLMs with trade-off dilemmas is addressed by: i) decomposing the integer problem into a relaxed form by the local server, ii) utilizing a discretization rounding scheme that provides optimal LLM combinations by the scheduling cloud, and iii) continual online updates based on feedback. Theoretically, we prove that \textit{C2MAB-V} offers strict guarantees over versatile reward models, matching state-of-the-art results for regret and violations in some degenerate cases. Empirically, we show that \textit{C2MAB-V} effectively balances performance and cost-efficiency with nine LLMs for three application scenarios.

Cost-Effective Online Multi-LLM Selection with Versatile Reward Models

TL;DR

Abstract

Paper Structure (25 sections, 8 theorems, 32 equations, 14 figures, 4 tables, 3 algorithms)

This paper contains 25 sections, 8 theorems, 32 equations, 14 figures, 4 tables, 3 algorithms.

Introduction
Related Work and Motivation
Related Work
Motivation
Problem Formulation
Algorithm Design
Procedures by Local Server
Procedures by Scheduling Cloud
Performance Analysis
Performance Evaluation
Conclusion
Description Supplement
Discretization Rounding Algorithms
Constraint and Reward Analysis
Discussions on Constraint Type and Extended Task Type
...and 10 more sections

Key Result

Lemma 1

For each round $t$ and LLM $k \in \mathcal{K}$, define $\mathcal{N}_{\mu}$ as the event where $\left|\hat{\mu}_{t, k}-\mu_k\right| <\rho_{t,\mu_k}$, and $\mathcal{N}_c$ as $\left|\hat{c}_{t, k}-c_k\right|<\rho_{t,c_k}$. Then, the probability of $\mathcal{N}_{\mu},\mathcal{N}_c$ occurring is at least

Figures (14)

Figure 1: Accuracy of different LLMs across varied problem samples.
Figure 2: Simple example of combinatorial LLMs in a cascading form.
Figure 3: Design of C2MAB-V workflow, with detailed process descriptions provided on the left main text.
Figure 4: Reward/violation ratio of three task types with nine different LLMs.
Figure 5: Sample conversation with LLM on biology, chemistry, geography, and physics.
...and 9 more figures

Theorems & Definitions (11)

Lemma 1
Theorem 1: Regret Bound
Remark 1
Theorem 2: Violation Bound
Remark 2
Lemma 2: Theorem 2.1 in chekuri2009dependent
Lemma 3: Theorem 1 in sun2023simple
proof
Lemma 4: Subgaussian random variables
Lemma 5: Chernoff-Hoeffding inequality dubhashi2009concentration.
...and 1 more

Cost-Effective Online Multi-LLM Selection with Versatile Reward Models

TL;DR

Abstract

Cost-Effective Online Multi-LLM Selection with Versatile Reward Models

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (14)

Theorems & Definitions (11)