Table of Contents
Fetching ...

Auto-Bench: An Automated Benchmark for Scientific Discovery in LLMs

Tingting Chen, Srinivas Anumasa, Beibei Lin, Vedant Shah, Anirudh Goyal, Dianbo Liu

TL;DR

Auto-Bench formalizes scientific discovery as iterative causal-graph discovery where LLMs interact with an Oracle to identify hidden adjacencies in directed (Chemistry) and undirected (Social Network) graphs. The methodology uses autonomous cycles of hypothesis, intervention, observation, and refinement, with evaluation via reachability-based similarity and long-trajectories across color-change matrices. Key findings show state-of-the-art LLMs exhibit strong initial performance but substantial degradation as graph size and trajectory length increase, highlighting gaps in scalable, autonomous scientific reasoning. The work provides a benchmarking framework and insights to guide future architectural and prompting strategies for robust iterative reasoning in AI-driven discovery.

Abstract

Given the remarkable performance of Large Language Models (LLMs), an important question arises: Can LLMs conduct human-like scientific research and discover new knowledge, and act as an AI scientist? Scientific discovery is an iterative process that demands efficient knowledge updating and encoding. It involves understanding the environment, identifying new hypotheses, and reasoning about actions; however, no standardized benchmark specifically designed for scientific discovery exists for LLM agents. In response to these limitations, we introduce a novel benchmark, \textit{Auto-Bench}, that encompasses necessary aspects to evaluate LLMs for scientific discovery in both natural and social sciences. Our benchmark is based on the principles of causal graph discovery. It challenges models to uncover hidden structures and make optimal decisions, which includes generating valid justifications. By engaging interactively with an oracle, the models iteratively refine their understanding of underlying interactions, the chemistry and social interactions, through strategic interventions. We evaluate state-of-the-art LLMs, including GPT-4, Gemini, Qwen, Claude, and Llama, and observe a significant performance drop as the problem complexity increases, which suggests an important gap between machine and human intelligence that future development of LLMs need to take into consideration.

Auto-Bench: An Automated Benchmark for Scientific Discovery in LLMs

TL;DR

Auto-Bench formalizes scientific discovery as iterative causal-graph discovery where LLMs interact with an Oracle to identify hidden adjacencies in directed (Chemistry) and undirected (Social Network) graphs. The methodology uses autonomous cycles of hypothesis, intervention, observation, and refinement, with evaluation via reachability-based similarity and long-trajectories across color-change matrices. Key findings show state-of-the-art LLMs exhibit strong initial performance but substantial degradation as graph size and trajectory length increase, highlighting gaps in scalable, autonomous scientific reasoning. The work provides a benchmarking framework and insights to guide future architectural and prompting strategies for robust iterative reasoning in AI-driven discovery.

Abstract

Given the remarkable performance of Large Language Models (LLMs), an important question arises: Can LLMs conduct human-like scientific research and discover new knowledge, and act as an AI scientist? Scientific discovery is an iterative process that demands efficient knowledge updating and encoding. It involves understanding the environment, identifying new hypotheses, and reasoning about actions; however, no standardized benchmark specifically designed for scientific discovery exists for LLM agents. In response to these limitations, we introduce a novel benchmark, \textit{Auto-Bench}, that encompasses necessary aspects to evaluate LLMs for scientific discovery in both natural and social sciences. Our benchmark is based on the principles of causal graph discovery. It challenges models to uncover hidden structures and make optimal decisions, which includes generating valid justifications. By engaging interactively with an oracle, the models iteratively refine their understanding of underlying interactions, the chemistry and social interactions, through strategic interventions. We evaluate state-of-the-art LLMs, including GPT-4, Gemini, Qwen, Claude, and Llama, and observe a significant performance drop as the problem complexity increases, which suggests an important gap between machine and human intelligence that future development of LLMs need to take into consideration.

Paper Structure

This paper contains 17 sections, 5 equations, 4 figures, 2 tables.

Figures (4)

  • Figure 1: The framework of our Autonomous Cycle. (A) represents the complete benchmarking cycle. The LLMs are provided with task descriptions, previous interventions they proposed, and the corresponding observations. Based on this information, the LLMs generate adjacency matrices and proposes a new intervention to gather additional data. A new observation is then obtained and added to the input. (B) outlines the conditions for terminating or continuing the loop. The loop terminates when the generated adjacency matrix matches the underlying causal graph; otherwise, it continues. To simulate real-world scientific problems, we include two experimental settings: chemistry and social networks.
  • Figure 2: Illustration of the Chemistry setting. The brackets indicate (molecule index, molecule state). Figures (a) and (b) illustrate the change in state after an intervention on molecule 0. Figures (c) and (d) present a case where causal graph A and causal graph B result in the same observations.
  • Figure 3: Illustration of the Social Network setting. The brackets indicate (person index, person state). Figures (a) and (b) illustrate the change in state after an intervention on person 0.
  • Figure 4: Average Trajectory Accuracy vs. Trajectory