Which LLM to Play? Convergence-Aware Online Model Selection with Time-Increasing Bandits

Yu Xia; Fang Kong; Tong Yu; Liya Guo; Ryan A. Rossi; Sungchul Kim; Shuai Li

Which LLM to Play? Convergence-Aware Online Model Selection with Time-Increasing Bandits

Yu Xia, Fang Kong, Tong Yu, Liya Guo, Ryan A. Rossi, Sungchul Kim, Shuai Li

TL;DR

This paper addresses online model selection when model performance increases with finetuning and then converges, a pattern common in LLM deployment. It introduces TI-UCB, a two-phase time-increasing bandit algorithm that uses least-squares linear growth modeling for reward increases and a sliding-window change-detection mechanism to identify convergence points, achieving a logarithmic regret bound $\,\mathbb{E}[R(T)] = O\left(\frac{\ln T}{\Delta_{\min}^2}\right)$. Through extensive synthetic and real-world experiments (classification and LLM summarization with finetuning costs), TI-UCB consistently outperforms baselines by efficiently balancing exploration and exploitation while adapting to convergences. The work demonstrates the practical value of leveraging the increasing-then-converging trend for economical, scalable online model selection in real-world LLM deployments, and provides guidance on change-detection window choices. Overall, TI-UCB offers a principled, theoretically grounded approach to rapidly converge on high-performing models under expensive finetuning constraints, with robust performance across fluctuating reward landscapes.

Abstract

Web-based applications such as chatbots, search engines and news recommendations continue to grow in scale and complexity with the recent surge in the adoption of LLMs. Online model selection has thus garnered increasing attention due to the need to choose the best model among a diverse set while balancing task reward and exploration cost. Organizations faces decisions like whether to employ a costly API-based LLM or a locally finetuned small LLM, weighing cost against performance. Traditional selection methods often evaluate every candidate model before choosing one, which are becoming impractical given the rising costs of training and finetuning LLMs. Moreover, it is undesirable to allocate excessive resources towards exploring poor-performing models. While some recent works leverage online bandit algorithm to manage such exploration-exploitation trade-off in model selection, they tend to overlook the increasing-then-converging trend in model performances as the model is iteratively finetuned, leading to less accurate predictions and suboptimal model selections. In this paper, we propose a time-increasing bandit algorithm TI-UCB, which effectively predicts the increase of model performances due to finetuning and efficiently balances exploration and exploitation in model selection. To further capture the converging points of models, we develop a change detection mechanism by comparing consecutive increase predictions. We theoretically prove that our algorithm achieves a logarithmic regret upper bound in a typical increasing bandit setting, which implies a fast convergence rate. The advantage of our method is also empirically validated through extensive experiments on classification model selection and online selection of LLMs. Our results highlight the importance of utilizing increasing-then-converging pattern for more efficient and economic model selection in the deployment of LLMs.

Which LLM to Play? Convergence-Aware Online Model Selection with Time-Increasing Bandits

TL;DR

. Through extensive synthetic and real-world experiments (classification and LLM summarization with finetuning costs), TI-UCB consistently outperforms baselines by efficiently balancing exploration and exploitation while adapting to convergences. The work demonstrates the practical value of leveraging the increasing-then-converging trend for economical, scalable online model selection in real-world LLM deployments, and provides guidance on change-detection window choices. Overall, TI-UCB offers a principled, theoretically grounded approach to rapidly converge on high-performing models under expensive finetuning constraints, with robust performance across fluctuating reward landscapes.

Abstract

Paper Structure (45 sections, 8 theorems, 34 equations, 6 figures, 1 algorithm)

This paper contains 45 sections, 8 theorems, 34 equations, 6 figures, 1 algorithm.

Introduction
Related Work
Online Model Selection
Non-Stationary Bandits
Problem Formulation
Online Model Selection
Model Selection
Reward Observation
Model Finetuning
Time-Increasing Bandits
Proposed Method
TI-UCB Algorithm
Increase Prediction
Change Detection
Regret Upper Bound of TI-UCB
...and 30 more sections

Key Result

Proposition 1

The upper confidence bound in TI-UCB for arm $i$ can be defined as Then for any $\delta\in(0, 1)$, $\mu \leq \hat{\mu} + 16\sqrt{\frac{2\ln(1/\delta)}{n}}$ holds with probability at least $1-\delta$. Detailed proof is provided in Appendix proof:concen.

Figures (6)

Figure 1: An illustrative example of online model selection for LLM summarization.
Figure 2: Increasing-then-converging reward trends of an API-based LLM (GPT-3 Davinci) and a local small LLM (GPT-2 Medium) over finetuning steps on a text summarization dataset. The reward considers both model performance and finetuning cost as detailed in Section \ref{['sec:LLM']}. GPT-2 Medium is observed to outperform GPT-3 Davinci after certain finetuning steps and hence such reward trends make it non-trivial to apply existing methods for online model selection.
Figure 3: Online selection of generated synthetic models covering a variety of increasing-then-converging patterns.
Figure 4: Online selection of canonical classification models on IMDB datasets.
Figure 5: Online selection of large language models on XSum datasets for summarization.
...and 1 more figures

Theorems & Definitions (8)

Proposition 1
Proposition 2
Theorem 1
Lemma 1
Lemma 2
Lemma 3
Lemma 4: Regret bound for $F_1^{c}$
Lemma 5: Regret bound from $\nu_1$ to $T$

Which LLM to Play? Convergence-Aware Online Model Selection with Time-Increasing Bandits

TL;DR

Abstract

Which LLM to Play? Convergence-Aware Online Model Selection with Time-Increasing Bandits

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (6)

Theorems & Definitions (8)