Table of Contents
Fetching ...

Can foundation models actively gather information in interactive environments to test hypotheses?

Danny P. Sawyer, Nan Rosemary Ke, Hubert Soyer, Martin Engelcke, David P Reichert, Drew A. Hudson, John Reid, Alexander Lerchner, Danilo Jimenez Rezende, Timothy P Lillicrap, Michael Mozer, Jane X Wang

TL;DR

Foundation models struggle with multi-turn exploration in dynamic environments, prompting the need to study information gathering, meta-learning, and strategy adaptation. The authors evaluate several frontier models in two controlled settings: a text-based Feature World and a text-based Alchemy benchmark, using zero-shot prompting and inter-trial summarization to test emergent exploratory capabilities. Results show near-optimal information gathering in simple tasks, but robust meta-learning and strategy adaptation only emerge when models are prompted to summarize across trials, with notable cross-model differences (e.g., Gemini 2.5 Pro and Claude 3.7 outperforming ChatGPT-4o and o4-mini). The work demonstrates that the major challenge lies in integrating knowledge over time, not in moment-to-moment action selection, and argues that Alchemy is a valuable benchmark for advancing interactive, hypothesis-testing capabilities of foundation models.

Abstract

Foundation models excel at single-turn reasoning but struggle with multi-turn exploration in dynamic environments, a requirement for many real-world challenges. We evaluated these models on their ability to learn from experience, adapt, and gather information. First, in "Feature World," a simple setting for testing information gathering, models performed near-optimally. However, to test more complex, multi-trial learning, we implemented a text-based version of the "Alchemy" environment, a benchmark for meta-learning. Here, agents must deduce a latent causal structure by integrating information across many trials. In this setting, recent foundation models initially failed to improve their performance over time. Crucially, we found that prompting the models to summarize their observations at regular intervals enabled an emergent meta-learning process. This allowed them to improve across trials and even adaptively re-learn when the environment's rules changed unexpectedly. While most models handled the simple task, Alchemy revealed stark differences in robustness: Gemini 2.5 performed best, followed by Claude 3.7, while ChatGPT-4o and o4-mini struggled. This underscores Alchemy's value as a benchmark. Our findings demonstrate that the biggest challenge for foundation models is not selecting informative actions in the moment, but integrating knowledge through adaptive strategies over time. Encouragingly, there appears to be no intrinsic barrier to future models mastering these abilities.

Can foundation models actively gather information in interactive environments to test hypotheses?

TL;DR

Foundation models struggle with multi-turn exploration in dynamic environments, prompting the need to study information gathering, meta-learning, and strategy adaptation. The authors evaluate several frontier models in two controlled settings: a text-based Feature World and a text-based Alchemy benchmark, using zero-shot prompting and inter-trial summarization to test emergent exploratory capabilities. Results show near-optimal information gathering in simple tasks, but robust meta-learning and strategy adaptation only emerge when models are prompted to summarize across trials, with notable cross-model differences (e.g., Gemini 2.5 Pro and Claude 3.7 outperforming ChatGPT-4o and o4-mini). The work demonstrates that the major challenge lies in integrating knowledge over time, not in moment-to-moment action selection, and argues that Alchemy is a valuable benchmark for advancing interactive, hypothesis-testing capabilities of foundation models.

Abstract

Foundation models excel at single-turn reasoning but struggle with multi-turn exploration in dynamic environments, a requirement for many real-world challenges. We evaluated these models on their ability to learn from experience, adapt, and gather information. First, in "Feature World," a simple setting for testing information gathering, models performed near-optimally. However, to test more complex, multi-trial learning, we implemented a text-based version of the "Alchemy" environment, a benchmark for meta-learning. Here, agents must deduce a latent causal structure by integrating information across many trials. In this setting, recent foundation models initially failed to improve their performance over time. Crucially, we found that prompting the models to summarize their observations at regular intervals enabled an emergent meta-learning process. This allowed them to improve across trials and even adaptively re-learn when the environment's rules changed unexpectedly. While most models handled the simple task, Alchemy revealed stark differences in robustness: Gemini 2.5 performed best, followed by Claude 3.7, while ChatGPT-4o and o4-mini struggled. This underscores Alchemy's value as a benchmark. Our findings demonstrate that the biggest challenge for foundation models is not selecting informative actions in the moment, but integrating knowledge through adaptive strategies over time. Encouragingly, there appears to be no intrinsic barrier to future models mastering these abilities.

Paper Structure

This paper contains 36 sections, 14 figures, 8 tables.

Figures (14)

  • Figure 1: Task structures and experimental setups for Feature World. (a) Example task setup for text environment with single-feature reward function, with "blue" as the rewarding feature. (b) Example task setup for text environment with conjunction reward function, with "blue" and "cube" as the rewarding conjunction. (c) Schematic of text Feature World experiment setup.
  • Figure 2: Fraction of Feature World episodes in which models found a rewarding object before reaching the maximum number of exploration steps. (a) Single-feature reward function. (b) Conjunction reward function. Error bars represent standard error of the mean, with 200 episodes per condition for the models and 1000 for the random and optimal baselines.
  • Figure 3: Schematic and performance metrics for 3D exploration task, with 15 episodes per condition. (a) Mean number of exploration steps (objects placed on the conveyor) before sufficient information is available to determine the correct factor. (b) Accuracy of the model in determining the correct rewarding feature. Hatched blue bar represents accuracy if episodes with vision errors are removed. Error bars represent standard error of the mean.
  • Figure 4: Task structures and experimental setup for Alchemy. Upper left: The structure of an Alchemy experiment. Upper right: Example text observations of the initial state of two separate trials from the same episode, in which stones and potions are resampled but the effects of potions and the reward values of stones remain the same. Lower left: Example chemistries, represented as graphs determining the effects of potions (edges) on stones of different properties (nodes), that change between episodes. Lower right: Directed exploration setup in which an LLM receives feedback from the environment, information from a prompt, past history of the episode, and, optionally, a summary of the episode history. The two left panels are adapted, with permission, from figures in wang2021alchemy.
  • Figure 5: Mean Alchemy episode scores for different models and conditions. (a) No summarization, no prior information. (b) No summarization, prior information. (c) Summarization, no prior information. (d) Summarization, prior information. N=10 replicates of 10-trial episodes. Error bars represent standard error of the mean. Asterisk indicates the mean is significantly different from that of the memoryless heuristic ($p<0.05$, paired-sample t-test).
  • ...and 9 more figures