Table of Contents
Fetching ...

Should You Use Your Large Language Model to Explore or Exploit?

Keegan Harris, Aleksandrs Slivkins

TL;DR

The paper tackles how to leverage large language models for the explore-exploit tradeoff in decision-making, focusing on contextual bandits. It systematically evaluates multiple LLMs as exploitation and exploration oracles across MAB/CB puzzles and large action spaces, including QA and arXiv-based tasks, with various prompting strategies and mitigations. The findings show current LLMs struggle with exploitation on non-trivial tasks and often underperform simple linear baselines, while they can substantially aid exploration by proposing semantically meaningful candidate actions in high-dimensional spaces. This establishes a clear boundary for LLM utility in decision pipelines and points to future work on tool-enabled exploitation and smarter, semantically guided exploration.

Abstract

We evaluate the ability of the current generation of large language models (LLMs) to help a decision-making agent facing an exploration-exploitation tradeoff. We use LLMs to explore and exploit in silos in various (contextual) bandit tasks. We find that while the current LLMs often struggle to exploit, in-context mitigations may be used to substantially improve performance for small-scale tasks. However even then, LLMs perform worse than a simple linear regression. On the other hand, we find that LLMs do help at exploring large action spaces with inherent semantics, by suggesting suitable candidates to explore.

Should You Use Your Large Language Model to Explore or Exploit?

TL;DR

The paper tackles how to leverage large language models for the explore-exploit tradeoff in decision-making, focusing on contextual bandits. It systematically evaluates multiple LLMs as exploitation and exploration oracles across MAB/CB puzzles and large action spaces, including QA and arXiv-based tasks, with various prompting strategies and mitigations. The findings show current LLMs struggle with exploitation on non-trivial tasks and often underperform simple linear baselines, while they can substantially aid exploration by proposing semantically meaningful candidate actions in high-dimensional spaces. This establishes a clear boundary for LLM utility in decision pipelines and points to future work on tool-enabled exploitation and smarter, semantically guided exploration.

Abstract

We evaluate the ability of the current generation of large language models (LLMs) to help a decision-making agent facing an exploration-exploitation tradeoff. We use LLMs to explore and exploit in silos in various (contextual) bandit tasks. We find that while the current LLMs often struggle to exploit, in-context mitigations may be used to substantially improve performance for small-scale tasks. However even then, LLMs perform worse than a simple linear regression. On the other hand, we find that LLMs do help at exploring large action spaces with inherent semantics, by suggesting suitable candidates to explore.

Paper Structure

This paper contains 15 sections, 39 figures, 9 tables.

Figures (39)

  • Figure 1: MAB exploit puzzle for $\textsc{Gpt-4}\xspace$ (left), $\textsc{Gpt-4}\xspace$ with CoT (middle), and $\textsc{Gpt-3.5}\xspace$ with CoT (right), all with "buttons" prompt. The following conventions apply to all figures in this section. Each line corresponds to a particular value of #rounds $T$ and plots $\mathrm{FracCorrect}\xspace(\epsilon,T)$ against empirical gap $\epsilon$ on the X-axis. The shaded band around the line represents a $95\%$ confidence interval. The dashed line is the number of tasks ("runs") with empirical gap $\leq\epsilon$; the resp. Y-scale is on the right.
  • Figure 2: Gpt-4 succeeds on a small CB exploit puzzle (left), but fails on a slightly larger one (right).
  • Figure 3: CB exploit puzzle with $d=K=2$ and $T=4000$: mitigations help substantially. $\textsc{Gpt-4}\xspace$ without CoT (left) and $\textsc{Gpt-4}\xspace$ with CoT (right). Note that providing the full history with this $T$ vastly exceeds the context window for $\textsc{Gpt-4}\xspace$, $\textsc{Gpt-4o}\xspace$, and $\textsc{Gpt-3.5}\xspace$.
  • Figure 4: CB exploit puzzle with $d=K=5$ and $T=1000$: mitigations perform badly, but (mostly) much better than the no-mitigation baseline. $\textsc{Gpt-4o}\xspace$ without CoT.
  • Figure 5: Left: Performance of Deepseek-R1 on our numerical CB puzzle. Right: $\textsc{Gpt-4o}\xspace$ on the text-based CB exploit puzzle. Some mitigations help, but are outperformed by linear regression.
  • ...and 34 more figures