Cost-Effective Online Multi-LLM Selection with Versatile Reward Models
Xiangxiang Dai, Jin Li, Xutong Liu, Anqi Yu, John C. S. Lui
TL;DR
C2MAB-V addresses online, cost-aware selection of multiple LLMs by casting the problem as a combinatorial bandit with versatile reward models (AWC, SUC, AIC) under a long-term budget. It uses a two-tier architecture (local server for online estimates and cloud for solving a relaxed continuous problem and discretization rounding) to tackle NP-hard combinatorial selection, offering sublinear regret and diminishing budget violations. The framework is theoretically grounded and empirically validated with nine LLMs over three task types, demonstrating favorable reward-cost trade-offs and rapid convergence. This approach enables scalable, task-structured, cost-conscious LLM orchestration suitable for real-world deployment with privacy-preserving, online adaptation capabilities.
Abstract
With the rapid advancement of large language models (LLMs), the diversity of multi-LLM tasks and the variability in their pricing structures have become increasingly important, as costs can vary greatly between different LLMs. To tackle these challenges, we introduce the \textit{C2MAB-V}, a \underline{C}ost-effective \underline{C}ombinatorial \underline{M}ulti-armed \underline{B}andit with \underline{V}ersatile reward models for optimal LLM selection and usage. This online model differs from traditional static approaches or those reliant on a single LLM without cost consideration. With multiple LLMs deployed on a scheduling cloud and a local server dedicated to handling user queries, \textit{C2MAB-V} facilitates the selection of multiple LLMs over a combinatorial search space, specifically tailored for various collaborative task types with different reward models. Based on our designed online feedback mechanism and confidence bound technique, \textit{C2MAB-V} can effectively address the multi-LLM selection challenge by managing the exploration-exploitation trade-off across different models, while also balancing cost and reward for diverse tasks. The NP-hard integer linear programming problem for selecting multiple LLMs with trade-off dilemmas is addressed by: i) decomposing the integer problem into a relaxed form by the local server, ii) utilizing a discretization rounding scheme that provides optimal LLM combinations by the scheduling cloud, and iii) continual online updates based on feedback. Theoretically, we prove that \textit{C2MAB-V} offers strict guarantees over versatile reward models, matching state-of-the-art results for regret and violations in some degenerate cases. Empirically, we show that \textit{C2MAB-V} effectively balances performance and cost-efficiency with nine LLMs for three application scenarios.
