Table of Contents
Fetching ...

Jump Starting Bandits with LLM-Generated Prior Knowledge

Parand A. Alamdari, Yanshuai Cao, Kevin H. Wilson

TL;DR

This work proposes an initialization algorithm for contextual bandits by prompting LLMs to produce a pre-training dataset of approximate human preferences for the bandit, which significantly reduces online learning regret and data-gathering costs for training such models.

Abstract

We present substantial evidence demonstrating the benefits of integrating Large Language Models (LLMs) with a Contextual Multi-Armed Bandit framework. Contextual bandits have been widely used in recommendation systems to generate personalized suggestions based on user-specific contexts. We show that LLMs, pre-trained on extensive corpora rich in human knowledge and preferences, can simulate human behaviours well enough to jump-start contextual multi-armed bandits to reduce online learning regret. We propose an initialization algorithm for contextual bandits by prompting LLMs to produce a pre-training dataset of approximate human preferences for the bandit. This significantly reduces online learning regret and data-gathering costs for training such models. Our approach is validated empirically through two sets of experiments with different bandit setups: one which utilizes LLMs to serve as an oracle and a real-world experiment utilizing data from a conjoint survey experiment.

Jump Starting Bandits with LLM-Generated Prior Knowledge

TL;DR

This work proposes an initialization algorithm for contextual bandits by prompting LLMs to produce a pre-training dataset of approximate human preferences for the bandit, which significantly reduces online learning regret and data-gathering costs for training such models.

Abstract

We present substantial evidence demonstrating the benefits of integrating Large Language Models (LLMs) with a Contextual Multi-Armed Bandit framework. Contextual bandits have been widely used in recommendation systems to generate personalized suggestions based on user-specific contexts. We show that LLMs, pre-trained on extensive corpora rich in human knowledge and preferences, can simulate human behaviours well enough to jump-start contextual multi-armed bandits to reduce online learning regret. We propose an initialization algorithm for contextual bandits by prompting LLMs to produce a pre-training dataset of approximate human preferences for the bandit. This significantly reduces online learning regret and data-gathering costs for training such models. Our approach is validated empirically through two sets of experiments with different bandit setups: one which utilizes LLMs to serve as an oracle and a real-world experiment utilizing data from a conjoint survey experiment.
Paper Structure (37 sections, 4 equations, 4 figures, 3 tables, 1 algorithm)

This paper contains 37 sections, 4 equations, 4 figures, 3 tables, 1 algorithm.

Figures (4)

  • Figure 1: Using LLM to jump-start bandit learning: (left) pre-training using our proposed CBLI; (right) online learning of jump-started bandit.
  • Figure 2: Accumulated regret relative to a GPT-4o-based oracle across 1,000 samples. Each line represents a CB trained using data generated by CBLI with $\mathcal{M}$ indicated in the legend. Error bars represent variance over shuffling true responses 10 times.
  • Figure 3: Accumulated regret relative to true responses from $N=1970$ responses to a conjoint experiment. Each line represents a different instantiation of CBLI with a different LLM. Error bars represent variance over shuffling true responses 10 times.
  • Figure 4: Comparison of arm scoring methods using LLMs. Each bar represents the average reward of an arm across all users. The left figure illustrates results from scoring each arm individually, while the right figure shows results from scoring arms pair-wise as per \ref{['alg:CBLI']}.