Table of Contents
Fetching ...

Jump Start or False Start? A Theoretical and Empirical Evaluation of LLM-initialized Bandits

Adam Bayley, Xiaodan Zhu, Raquel Aoki, Yanshuai Cao, Kevin H. Wilson

Abstract

The recent advancement of Large Language Models (LLMs) offers new opportunities to generate user preference data to warm-start bandits. Recent studies on contextual bandits with LLM initialization (CBLI) have shown that these synthetic priors can significantly lower early regret. However, these findings assume that LLM-generated choices are reasonably aligned with actual user preferences. In this paper, we systematically examine how LLM-generated preferences perform when random and label-flipping noise is injected into the synthetic training data. For aligned domains, we find that warm-starting remains effective up to 30% corruption, loses its advantage around 40%, and degrades performance beyond 50%. When there is systematic misalignment, even without added noise, LLM-generated priors can lead to higher regret than a cold-start bandit. To explain these behaviors, we develop a theoretical analysis that decomposes the effect of random label noise and systematic misalignment on the prior error driving the bandit's regret, and derive a sufficient condition under which LLM-based warm starts are provably better than a cold-start bandit. We validate these results across multiple conjoint datasets and LLMs, showing that estimated alignment reliably tracks when warm-starting improves or degrades recommendation quality.

Jump Start or False Start? A Theoretical and Empirical Evaluation of LLM-initialized Bandits

Abstract

The recent advancement of Large Language Models (LLMs) offers new opportunities to generate user preference data to warm-start bandits. Recent studies on contextual bandits with LLM initialization (CBLI) have shown that these synthetic priors can significantly lower early regret. However, these findings assume that LLM-generated choices are reasonably aligned with actual user preferences. In this paper, we systematically examine how LLM-generated preferences perform when random and label-flipping noise is injected into the synthetic training data. For aligned domains, we find that warm-starting remains effective up to 30% corruption, loses its advantage around 40%, and degrades performance beyond 50%. When there is systematic misalignment, even without added noise, LLM-generated priors can lead to higher regret than a cold-start bandit. To explain these behaviors, we develop a theoretical analysis that decomposes the effect of random label noise and systematic misalignment on the prior error driving the bandit's regret, and derive a sufficient condition under which LLM-based warm starts are provably better than a cold-start bandit. We validate these results across multiple conjoint datasets and LLMs, showing that estimated alignment reliably tracks when warm-starting improves or degrades recommendation quality.

Paper Structure

This paper contains 47 sections, 1 theorem, 60 equations, 3 figures, 6 tables.

Key Result

Theorem 1

For any $\delta\in(0,1)$, with probability at least $1-\delta$ the warm-started ridge estimator satisfies, for all $t\ge 0$, where is the usual self-normalized variance term. $\blacktriangleleft$$\blacktriangleleft$

Figures (3)

  • Figure 1: Overview of the CBLI evaluation framework (Noisy-CBLI). An LLM generates synthetic preference data, which is optionally corrupted with random or label-flipping noise at rate $p$, and used to warm-start a LinUCB bandit that is then fine-tuned on real user data.
  • Figure 2: Cumulative regret on the COVID-19 Vaccine dataset under preference-flipping noise. “10k_fX” indicates X times 10% of LLM-generated labels flipped. Shaded regions are 95 percent CI over $G=10$ runs.
  • Figure 3: Cumulative regret on the COVID-19 Vaccine dataset under random-response noise.

Theorems & Definitions (1)

  • Theorem 1: Prior-centered confidence inequality