Can large language models explore in-context?
Akshay Krishnamurthy, Keegan Harris, Dylan J. Foster, Cyril Zhang, Aleksandrs Slivkins
TL;DR
This study probes whether contemporary LLMs can perform exploration in-context when problem descriptions and histories are provided entirely in prompts. Using simple multi-armed bandit tasks, GPT-3.5, GPT-4, and Llama2 are evaluated across diverse prompt designs, with UCB, TS, and Greedy as baselines; due to cost, the study relies on surrogate statistics to diagnose long-term exploration behavior. The key finding is that almost all configurations fail to explore robustly, with two failure modes identified—suffix failures and uniform-like exploration—except for a single GPT-4 configuration that combines summarized history and reinforced chain-of-thought prompting, which achieves TS-like performance. The results suggest that non-trivial interventions, such as external history summarization or targeted training/data curation, may be required to empower LLM-based decision making in more complex settings, guiding future work on prompting strategies and learning paradigms for in-context agents.
Abstract
We investigate the extent to which contemporary Large Language Models (LLMs) can engage in exploration, a core capability in reinforcement learning and decision making. We focus on native performance of existing LLMs, without training interventions. We deploy LLMs as agents in simple multi-armed bandit environments, specifying the environment description and interaction history entirely in-context, i.e., within the LLM prompt. We experiment with GPT-3.5, GPT-4, and Llama2, using a variety of prompt designs, and find that the models do not robustly engage in exploration without substantial interventions: i) Across all of our experiments, only one configuration resulted in satisfactory exploratory behavior: GPT-4 with chain-of-thought reasoning and an externally summarized interaction history, presented as sufficient statistics; ii) All other configurations did not result in robust exploratory behavior, including those with chain-of-thought reasoning but unsummarized history. Although these findings can be interpreted positively, they suggest that external summarization -- which may not be possible in more complex settings -- is important for obtaining desirable behavior from LLM agents. We conclude that non-trivial algorithmic interventions, such as fine-tuning or dataset curation, may be required to empower LLM-based decision making agents in complex settings.
