Table of Contents
Fetching ...

BioDSA-1K: Benchmarking Data Science Agents for Biomedical Research

Zifeng Wang, Benjamin Danek, Jimeng Sun

TL;DR

BioDSA-1K introduces a large-scale, data-driven benchmark for biomedical hypothesis validation, derived from >300 publications and anchored in real-world datasets via cBioPortal. It assesses AI agents on hypothesis decision accuracy, evidence grounding, reasoning fidelity, and executable analysis code, including non-verifiable cases to reflect practice where data are inconclusive. Across multiple agent designs, reasoning augmentation improves performance, yet evidence alignment remains modest, highlighting the challenge of aligning automated analyses with human-derived conclusions. The benchmark aims to foster trustworthy, generalizable AI agents for biomedical discovery by providing a realistic, diverse evaluation of end-to-end data science workflows.

Abstract

Validating scientific hypotheses is a central challenge in biomedical research, and remains difficult for artificial intelligence (AI) agents due to the complexity of real-world data analysis and evidence interpretation. In this work, we present BioDSA-1K, a benchmark designed to evaluate AI agents on realistic, data-driven biomedical hypothesis validation tasks. BioDSA-1K consists of 1,029 hypothesis-centric tasks paired with 1,177 analysis plans, curated from over 300 published biomedical studies to reflect the structure and reasoning found in authentic research workflows. Each task includes a structured hypothesis derived from the original study's conclusions, expressed in the affirmative to reflect the language of scientific reporting, and one or more pieces of supporting evidence grounded in empirical data tables. While these hypotheses mirror published claims, they remain testable using standard statistical or machine learning methods. The benchmark enables evaluation along four axes: (1) hypothesis decision accuracy, (2) alignment between evidence and conclusion, (3) correctness of the reasoning process, and (4) executability of the AI-generated analysis code. Importantly, BioDSA-1K includes non-verifiable hypotheses: cases where the available data are insufficient to support or refute a claim, reflecting a common yet underexplored scenario in real-world science. We propose BioDSA-1K as a foundation for building and evaluating generalizable, trustworthy AI agents for biomedical discovery.

BioDSA-1K: Benchmarking Data Science Agents for Biomedical Research

TL;DR

BioDSA-1K introduces a large-scale, data-driven benchmark for biomedical hypothesis validation, derived from >300 publications and anchored in real-world datasets via cBioPortal. It assesses AI agents on hypothesis decision accuracy, evidence grounding, reasoning fidelity, and executable analysis code, including non-verifiable cases to reflect practice where data are inconclusive. Across multiple agent designs, reasoning augmentation improves performance, yet evidence alignment remains modest, highlighting the challenge of aligning automated analyses with human-derived conclusions. The benchmark aims to foster trustworthy, generalizable AI agents for biomedical discovery by providing a realistic, diverse evaluation of end-to-end data science workflows.

Abstract

Validating scientific hypotheses is a central challenge in biomedical research, and remains difficult for artificial intelligence (AI) agents due to the complexity of real-world data analysis and evidence interpretation. In this work, we present BioDSA-1K, a benchmark designed to evaluate AI agents on realistic, data-driven biomedical hypothesis validation tasks. BioDSA-1K consists of 1,029 hypothesis-centric tasks paired with 1,177 analysis plans, curated from over 300 published biomedical studies to reflect the structure and reasoning found in authentic research workflows. Each task includes a structured hypothesis derived from the original study's conclusions, expressed in the affirmative to reflect the language of scientific reporting, and one or more pieces of supporting evidence grounded in empirical data tables. While these hypotheses mirror published claims, they remain testable using standard statistical or machine learning methods. The benchmark enables evaluation along four axes: (1) hypothesis decision accuracy, (2) alignment between evidence and conclusion, (3) correctness of the reasoning process, and (4) executability of the AI-generated analysis code. Importantly, BioDSA-1K includes non-verifiable hypotheses: cases where the available data are insufficient to support or refute a claim, reflecting a common yet underexplored scenario in real-world science. We propose BioDSA-1K as a foundation for building and evaluating generalizable, trustworthy AI agents for biomedical discovery.

Paper Structure

This paper contains 42 sections, 4 equations, 9 figures, 2 tables.

Figures (9)

  • Figure 1: Benchmark statistics. (left) BioDSA-1K includes diverse types of biomedical research and data analysis tasks created from 329 publications; the x-axis indicates the publication types.; (Right) Bubble plot illustrating the diverse range of biomedical data tables in BioDSA-1K, showing each data table's number of rows (x-axis, log-scale) versus number of columns (y-axis, log-scale).
  • Figure 1: Examples of the hypothesis, counter-hypothesis, and supporting evidence extracted from biomedical publications.
  • Figure 2: Overview of BioDSA-1K. a, Benchmark curation: Scientific publications linked to biomedical datasets are parsed to extract hypotheses and their corresponding supporting evidence, forming the core reasoning challenges. b, Experiments: AI agents are tasked with validating hypotheses by planning analysis steps, generating executable code, observing results, and making decisions based on structured biomedical datasets. c Evaluation metrics: Agent performance is evaluated based on hypothesis decision accuracy (Type I and Type II errors), evidence alignment with publication findings, non-verifiable hypothesis detection (precision and recall), and code executability rate.
  • Figure 3: Comparison of Type I and Type II error rates across publication types and agent variants. Each point denotes an agent's performance on a specific publication type.
  • Figure 4: Code excitability analysis and the breakdown of error types in non-executable code across the selected AI agents.
  • ...and 4 more figures