Can foundation models actively gather information in interactive environments to test hypotheses?
Danny P. Sawyer, Nan Rosemary Ke, Hubert Soyer, Martin Engelcke, David P Reichert, Drew A. Hudson, John Reid, Alexander Lerchner, Danilo Jimenez Rezende, Timothy P Lillicrap, Michael Mozer, Jane X Wang
TL;DR
Foundation models struggle with multi-turn exploration in dynamic environments, prompting the need to study information gathering, meta-learning, and strategy adaptation. The authors evaluate several frontier models in two controlled settings: a text-based Feature World and a text-based Alchemy benchmark, using zero-shot prompting and inter-trial summarization to test emergent exploratory capabilities. Results show near-optimal information gathering in simple tasks, but robust meta-learning and strategy adaptation only emerge when models are prompted to summarize across trials, with notable cross-model differences (e.g., Gemini 2.5 Pro and Claude 3.7 outperforming ChatGPT-4o and o4-mini). The work demonstrates that the major challenge lies in integrating knowledge over time, not in moment-to-moment action selection, and argues that Alchemy is a valuable benchmark for advancing interactive, hypothesis-testing capabilities of foundation models.
Abstract
Foundation models excel at single-turn reasoning but struggle with multi-turn exploration in dynamic environments, a requirement for many real-world challenges. We evaluated these models on their ability to learn from experience, adapt, and gather information. First, in "Feature World," a simple setting for testing information gathering, models performed near-optimally. However, to test more complex, multi-trial learning, we implemented a text-based version of the "Alchemy" environment, a benchmark for meta-learning. Here, agents must deduce a latent causal structure by integrating information across many trials. In this setting, recent foundation models initially failed to improve their performance over time. Crucially, we found that prompting the models to summarize their observations at regular intervals enabled an emergent meta-learning process. This allowed them to improve across trials and even adaptively re-learn when the environment's rules changed unexpectedly. While most models handled the simple task, Alchemy revealed stark differences in robustness: Gemini 2.5 performed best, followed by Claude 3.7, while ChatGPT-4o and o4-mini struggled. This underscores Alchemy's value as a benchmark. Our findings demonstrate that the biggest challenge for foundation models is not selecting informative actions in the moment, but integrating knowledge through adaptive strategies over time. Encouragingly, there appears to be no intrinsic barrier to future models mastering these abilities.
