Auto-Bench: An Automated Benchmark for Scientific Discovery in LLMs
Tingting Chen, Srinivas Anumasa, Beibei Lin, Vedant Shah, Anirudh Goyal, Dianbo Liu
TL;DR
Auto-Bench formalizes scientific discovery as iterative causal-graph discovery where LLMs interact with an Oracle to identify hidden adjacencies in directed (Chemistry) and undirected (Social Network) graphs. The methodology uses autonomous cycles of hypothesis, intervention, observation, and refinement, with evaluation via reachability-based similarity and long-trajectories across color-change matrices. Key findings show state-of-the-art LLMs exhibit strong initial performance but substantial degradation as graph size and trajectory length increase, highlighting gaps in scalable, autonomous scientific reasoning. The work provides a benchmarking framework and insights to guide future architectural and prompting strategies for robust iterative reasoning in AI-driven discovery.
Abstract
Given the remarkable performance of Large Language Models (LLMs), an important question arises: Can LLMs conduct human-like scientific research and discover new knowledge, and act as an AI scientist? Scientific discovery is an iterative process that demands efficient knowledge updating and encoding. It involves understanding the environment, identifying new hypotheses, and reasoning about actions; however, no standardized benchmark specifically designed for scientific discovery exists for LLM agents. In response to these limitations, we introduce a novel benchmark, \textit{Auto-Bench}, that encompasses necessary aspects to evaluate LLMs for scientific discovery in both natural and social sciences. Our benchmark is based on the principles of causal graph discovery. It challenges models to uncover hidden structures and make optimal decisions, which includes generating valid justifications. By engaging interactively with an oracle, the models iteratively refine their understanding of underlying interactions, the chemistry and social interactions, through strategic interventions. We evaluate state-of-the-art LLMs, including GPT-4, Gemini, Qwen, Claude, and Llama, and observe a significant performance drop as the problem complexity increases, which suggests an important gap between machine and human intelligence that future development of LLMs need to take into consideration.
