DiscoveryBench: Towards Data-Driven Discovery with Large Language Models

Bodhisattwa Prasad Majumder; Harshit Surana; Dhruv Agarwal; Bhavana Dalvi Mishra; Abhijeetsingh Meena; Aryan Prakhar; Tirth Vora; Tushar Khot; Ashish Sabharwal; Peter Clark

DiscoveryBench: Towards Data-Driven Discovery with Large Language Models

Bodhisattwa Prasad Majumder, Harshit Surana, Dhruv Agarwal, Bhavana Dalvi Mishra, Abhijeetsingh Meena, Aryan Prakhar, Tirth Vora, Tushar Khot, Ashish Sabharwal, Peter Clark

TL;DR

DiscoveryBench formalizes data-driven discovery for evaluating LLMs and introduces a dual-component benchmark (DB-Real and DB-Synth) with 264 real tasks and 903 synthetic tasks across diverse domains. It defines a hypothesis semantic tree framework and an HMS metric to assess alignment between predicted and ground-truth hypotheses, revealing that current systems struggle, with peak performance around 25%. The work provides open datasets, baselines, and evaluation tooling to spur progress in autonomous hypothesis search and verification, highlighting the need for better contextual understanding and scalable reasoning. The benchmark thus serves as a catalyst for advancing robust, reproducible autonomous scientific discovery using large generative models.

Abstract

Can the rapid advances in code generation, function calling, and data analysis using large language models (LLMs) help automate the search and verification of hypotheses purely from a set of provided datasets? To evaluate this question, we present DiscoveryBench, the first comprehensive benchmark that formalizes the multi-step process of data-driven discovery. The benchmark is designed to systematically assess current model capabilities in discovery tasks and provide a useful resource for improving them. Our benchmark contains 264 tasks collected across 6 diverse domains, such as sociology and engineering, by manually deriving discovery workflows from published papers to approximate the real-world challenges faced by researchers, where each task is defined by a dataset, its metadata, and a discovery goal in natural language. We additionally provide 903 synthetic tasks to conduct controlled evaluations across task complexity. Furthermore, our structured formalism of data-driven discovery enables a facet-based evaluation that provides useful insights into different failure modes. We evaluate several popular LLM-based reasoning frameworks using both open and closed LLMs as baselines on DiscoveryBench and find that even the best system scores only 25%. Our benchmark, thus, illustrates the challenges in autonomous data-driven discovery and serves as a valuable resource for the community to make progress.

DiscoveryBench: Towards Data-Driven Discovery with Large Language Models

TL;DR

Abstract

Paper Structure (30 sections, 1 equation, 9 figures, 1 table)

This paper contains 30 sections, 1 equation, 9 figures, 1 table.

Introduction
Related Work
Formalization
DiscoveryBench
DB-Real: Collecting data-driven hypotheses in the wild
Features of DB-Real benchmark
DB-Synth: Generating data-driven hypotheses using LLMs
Evaluation
Experiments
Discovery Agents
Main Results
Analysis
Conclusion
FAQs
Limitations
...and 15 more sections

Figures (9)

Figure 1: Each DiscoveryBench task consists of a goal and dataset(s) (left). Solving the task requires both statistical analysis and scientific semantic reasoning, e.g., deciding which analysis is appropriate for the domain, and mapping goal terms to column names (center). A faceted evaluation allows open-ended final answers to be rigorously evaluated (right).
Figure 2: Hypothesis Semantic Tree
Figure 3: Workflow categories in DB-Real with representative examples.
Figure 4: (Left) Hypothesis Matching Scores ($\mathrm{HMS}$) across agent-LLM pairs in DB-Real and DB-Synth. (Right) Scatter plot for $\mathrm{ctx}_\mathrm{F1}$ and average $\mathrm{var}_\mathrm{F1} \times \mathrm{rel}_\mathrm{acc}$, showing accurate contexts increases the probability of predicting variables and relations accurately. Scores are for the best model on DB-Real and only include data points (44.2%) where both scores are non-zero.
Figure 5: Best non-oracle agent's performance ($\mathrm{HMS}$) (a) across domains, (b) for goal types (dimension to be discovered), and (c) for different workflow lengths. In (c) workflow length categories for DB-Real are s: $<10$, m: $>10, <20$, l: $>20$. For DB-Synth, it is the semantic tree height.
...and 4 more figures

DiscoveryBench: Towards Data-Driven Discovery with Large Language Models

TL;DR

Abstract

DiscoveryBench: Towards Data-Driven Discovery with Large Language Models

Authors

TL;DR

Abstract

Table of Contents

Figures (9)