Table of Contents
Fetching ...

Can large language models explore in-context?

Akshay Krishnamurthy, Keegan Harris, Dylan J. Foster, Cyril Zhang, Aleksandrs Slivkins

TL;DR

This study probes whether contemporary LLMs can perform exploration in-context when problem descriptions and histories are provided entirely in prompts. Using simple multi-armed bandit tasks, GPT-3.5, GPT-4, and Llama2 are evaluated across diverse prompt designs, with UCB, TS, and Greedy as baselines; due to cost, the study relies on surrogate statistics to diagnose long-term exploration behavior. The key finding is that almost all configurations fail to explore robustly, with two failure modes identified—suffix failures and uniform-like exploration—except for a single GPT-4 configuration that combines summarized history and reinforced chain-of-thought prompting, which achieves TS-like performance. The results suggest that non-trivial interventions, such as external history summarization or targeted training/data curation, may be required to empower LLM-based decision making in more complex settings, guiding future work on prompting strategies and learning paradigms for in-context agents.

Abstract

We investigate the extent to which contemporary Large Language Models (LLMs) can engage in exploration, a core capability in reinforcement learning and decision making. We focus on native performance of existing LLMs, without training interventions. We deploy LLMs as agents in simple multi-armed bandit environments, specifying the environment description and interaction history entirely in-context, i.e., within the LLM prompt. We experiment with GPT-3.5, GPT-4, and Llama2, using a variety of prompt designs, and find that the models do not robustly engage in exploration without substantial interventions: i) Across all of our experiments, only one configuration resulted in satisfactory exploratory behavior: GPT-4 with chain-of-thought reasoning and an externally summarized interaction history, presented as sufficient statistics; ii) All other configurations did not result in robust exploratory behavior, including those with chain-of-thought reasoning but unsummarized history. Although these findings can be interpreted positively, they suggest that external summarization -- which may not be possible in more complex settings -- is important for obtaining desirable behavior from LLM agents. We conclude that non-trivial algorithmic interventions, such as fine-tuning or dataset curation, may be required to empower LLM-based decision making agents in complex settings.

Can large language models explore in-context?

TL;DR

This study probes whether contemporary LLMs can perform exploration in-context when problem descriptions and histories are provided entirely in prompts. Using simple multi-armed bandit tasks, GPT-3.5, GPT-4, and Llama2 are evaluated across diverse prompt designs, with UCB, TS, and Greedy as baselines; due to cost, the study relies on surrogate statistics to diagnose long-term exploration behavior. The key finding is that almost all configurations fail to explore robustly, with two failure modes identified—suffix failures and uniform-like exploration—except for a single GPT-4 configuration that combines summarized history and reinforced chain-of-thought prompting, which achieves TS-like performance. The results suggest that non-trivial interventions, such as external history summarization or targeted training/data curation, may be required to empower LLM-based decision making in more complex settings, guiding future work on prompting strategies and learning paradigms for in-context agents.

Abstract

We investigate the extent to which contemporary Large Language Models (LLMs) can engage in exploration, a core capability in reinforcement learning and decision making. We focus on native performance of existing LLMs, without training interventions. We deploy LLMs as agents in simple multi-armed bandit environments, specifying the environment description and interaction history entirely in-context, i.e., within the LLM prompt. We experiment with GPT-3.5, GPT-4, and Llama2, using a variety of prompt designs, and find that the models do not robustly engage in exploration without substantial interventions: i) Across all of our experiments, only one configuration resulted in satisfactory exploratory behavior: GPT-4 with chain-of-thought reasoning and an externally summarized interaction history, presented as sufficient statistics; ii) All other configurations did not result in robust exploratory behavior, including those with chain-of-thought reasoning but unsummarized history. Although these findings can be interpreted positively, they suggest that external summarization -- which may not be possible in more complex settings -- is important for obtaining desirable behavior from LLM agents. We conclude that non-trivial algorithmic interventions, such as fine-tuning or dataset curation, may be required to empower LLM-based decision making agents in complex settings.
Paper Structure (14 sections, 19 figures)

This paper contains 14 sections, 19 figures.

Figures (19)

  • Figure 1: Representative experiments: Two prompt configurations for Gpt-4 on a $5$-armed bandit problem, demonstrating exploration failure (top) and success (bottom). The baselines are two standard bandit algorithms with performance guarantees, Upper Confidence Bound (UCB) and Thompson Sampling (TS), as well as the Greedy algorithm, which always chooses an arm with the best average reward so far and is known to perform poorly. Visualizations are: (Left) histogram over replicates of the number of times the best arm is chosen, (Center) for each $t$, we plot the suffix failure frequency, the fraction of replicates for which the best arm is never chosen after time-step $t$, and (Right) cumulative time-averaged rewards, averaged over replicates. (a) Top row.Gpt-4 with our basic prompt design with zero temperature. The experiment runs for $T=500$ rounds, and is replicated $N=20$ times, varying environment randomness. This configuration exhibits highly bimodal behavior: a large ($>60\%$) fraction of replicates choose the best arm only a handful of times and exhibit suffix failures, similar to Greedy, and very unlike UCB and TS. This is suggestive of a long term failure to explore and, indeed, this configuration underperforms substantially in terms of reward. (b) Bottom row.Gpt-4 with a suggestive framing, summarized history, and chain-of-thought with zero temperature. The experiment runs for $T=200$ rounds and is replicated $N=40$ times. This configuration exhibits a unimodal distribution of plays of the best arm, very few suffix failures, and reward that is comparable to TS.
  • Figure 2: Prompt designs; see Figure \ref{['fig:prompts-text']} for a more detailed view. A prompt is generated by traversing the graph from top to bottom.
  • Figure 3: Scatter plot summarizing all experiments with $T=100$. We plot suffix failures (expressed via $\texttt{SuffFailFreq}(T/2)$) vs. uniform-like failures (expressed via $K\cdot\texttt{MinFrac}(T)$). Each LLM/configuration pair maps to a dot on this plane (some dots may overlap). The Gpt-4 configuration labeled with a star is BSS$\widetilde{\text{C}}$0, which is the only configuration that succeeds. We also plot $\epsilon$-Greedy, tracing out the different tradeoffs obtained for different values of $\epsilon$.
  • Figure 4: Gpt-4 for $T=100$: a per-configuration summary table on the hard MAB instance. Only three Gpt-4 configurations do not exhibit suffix failures; two of these (BNRND and BSSCD) exhibit uniform-like failures. The final configuration (BSS$\widetilde{\text{C}}$0) succeeds.
  • Figure 5: Detailed view of bimodal behavior and suffix failures for Gpt-4 with $T=100$. Configurations visualized are the basic configuration (BNRN0) and the same configuration but with temperature $1$ (BNRN1). Visualizations are the same as in fig:long.
  • ...and 14 more figures