Table of Contents
Fetching ...

Large Language Model-Enhanced Multi-Armed Bandits

Jiahang Sun, Zhiyong Wang, Runhan Yang, Chenjun Xiao, John C. S. Lui, Zhongxiang Dai

TL;DR

This work tackles the inefficiency of direct LLM-based arm selection in multi-armed bandits by embedding LLMs as reward predictors within classical MAB frameworks. It introduces three methods—TS-LLM, RO-LLM, and TS-LLM-DB—that leverage in-context learning for reward or preference prediction while using temperature scheduling to balance exploration and exploitation. Empirical results across synthetic tasks and real-world text datasets show consistent improvements over baselines relying on direct LLM arm selection, with particular strength in challenging, semantically weak settings. The approach highlights the practical value of combining the prediction strengths of LLMs with the principled exploration mechanisms of classical bandit algorithms, offering scalable insights for LLM-based decision-making in sequential tasks.

Abstract

Large language models (LLMs) have been adopted to solve sequential decision-making tasks such as multi-armed bandits (MAB), in which an LLM is directly instructed to select the arms to pull in every iteration. However, this paradigm of direct arm selection using LLMs has been shown to be suboptimal in many MAB tasks. Therefore, we propose an alternative approach which combines the strengths of classical MAB and LLMs. Specifically, we adopt a classical MAB algorithm as the high-level framework and leverage the strong in-context learning capability of LLMs to perform the sub-task of reward prediction. Firstly, we incorporate the LLM-based reward predictor into the classical Thompson sampling (TS) algorithm and adopt a decaying schedule for the LLM temperature to ensure a transition from exploration to exploitation. Next, we incorporate the LLM-based reward predictor (with a temperature of 0) into a regression oracle-based MAB algorithm equipped with an explicit exploration mechanism. We also extend our TS-based algorithm to dueling bandits where only the preference feedback between pairs of arms is available, which requires non-trivial algorithmic modifications. We conduct empirical evaluations using both synthetic MAB tasks and experiments designed using real-world text datasets, in which the results show that our algorithms consistently outperform previous baseline methods based on direct arm selection. Interestingly, we also demonstrate that in challenging tasks where the arms lack semantic meanings that can be exploited by the LLM, our approach achieves considerably better performance than LLM-based direct arm selection.

Large Language Model-Enhanced Multi-Armed Bandits

TL;DR

This work tackles the inefficiency of direct LLM-based arm selection in multi-armed bandits by embedding LLMs as reward predictors within classical MAB frameworks. It introduces three methods—TS-LLM, RO-LLM, and TS-LLM-DB—that leverage in-context learning for reward or preference prediction while using temperature scheduling to balance exploration and exploitation. Empirical results across synthetic tasks and real-world text datasets show consistent improvements over baselines relying on direct LLM arm selection, with particular strength in challenging, semantically weak settings. The approach highlights the practical value of combining the prediction strengths of LLMs with the principled exploration mechanisms of classical bandit algorithms, offering scalable insights for LLM-based decision-making in sequential tasks.

Abstract

Large language models (LLMs) have been adopted to solve sequential decision-making tasks such as multi-armed bandits (MAB), in which an LLM is directly instructed to select the arms to pull in every iteration. However, this paradigm of direct arm selection using LLMs has been shown to be suboptimal in many MAB tasks. Therefore, we propose an alternative approach which combines the strengths of classical MAB and LLMs. Specifically, we adopt a classical MAB algorithm as the high-level framework and leverage the strong in-context learning capability of LLMs to perform the sub-task of reward prediction. Firstly, we incorporate the LLM-based reward predictor into the classical Thompson sampling (TS) algorithm and adopt a decaying schedule for the LLM temperature to ensure a transition from exploration to exploitation. Next, we incorporate the LLM-based reward predictor (with a temperature of 0) into a regression oracle-based MAB algorithm equipped with an explicit exploration mechanism. We also extend our TS-based algorithm to dueling bandits where only the preference feedback between pairs of arms is available, which requires non-trivial algorithmic modifications. We conduct empirical evaluations using both synthetic MAB tasks and experiments designed using real-world text datasets, in which the results show that our algorithms consistently outperform previous baseline methods based on direct arm selection. Interestingly, we also demonstrate that in challenging tasks where the arms lack semantic meanings that can be exploited by the LLM, our approach achieves considerably better performance than LLM-based direct arm selection.

Paper Structure

This paper contains 25 sections, 2 equations, 7 figures, 3 algorithms.

Figures (7)

  • Figure 1: The performance of our TS-LLM and RO-LLM algorithms in classical stochastic MAB tasks.
  • Figure 2: The performance of our TS-LLM-DB algorithm in dueling bandits with linear and square latent reward functions.
  • Figure 3: The cumulative rewards in the text experiments using the OneShotWikiLinks and AmazonCat-13K datasets (Sec. \ref{['subsec:exp:text']}).
  • Figure 4: The cumulative rewards in the text experiments using the AmazonCat-13K dataset with $K=30$ arms.
  • Figure 5: The performance of our TS-LLM algorithm in stochastic MAB tasks with different temperatures.
  • ...and 2 more figures