BioDSA-1K: Benchmarking Data Science Agents for Biomedical Research
Zifeng Wang, Benjamin Danek, Jimeng Sun
TL;DR
BioDSA-1K introduces a large-scale, data-driven benchmark for biomedical hypothesis validation, derived from >300 publications and anchored in real-world datasets via cBioPortal. It assesses AI agents on hypothesis decision accuracy, evidence grounding, reasoning fidelity, and executable analysis code, including non-verifiable cases to reflect practice where data are inconclusive. Across multiple agent designs, reasoning augmentation improves performance, yet evidence alignment remains modest, highlighting the challenge of aligning automated analyses with human-derived conclusions. The benchmark aims to foster trustworthy, generalizable AI agents for biomedical discovery by providing a realistic, diverse evaluation of end-to-end data science workflows.
Abstract
Validating scientific hypotheses is a central challenge in biomedical research, and remains difficult for artificial intelligence (AI) agents due to the complexity of real-world data analysis and evidence interpretation. In this work, we present BioDSA-1K, a benchmark designed to evaluate AI agents on realistic, data-driven biomedical hypothesis validation tasks. BioDSA-1K consists of 1,029 hypothesis-centric tasks paired with 1,177 analysis plans, curated from over 300 published biomedical studies to reflect the structure and reasoning found in authentic research workflows. Each task includes a structured hypothesis derived from the original study's conclusions, expressed in the affirmative to reflect the language of scientific reporting, and one or more pieces of supporting evidence grounded in empirical data tables. While these hypotheses mirror published claims, they remain testable using standard statistical or machine learning methods. The benchmark enables evaluation along four axes: (1) hypothesis decision accuracy, (2) alignment between evidence and conclusion, (3) correctness of the reasoning process, and (4) executability of the AI-generated analysis code. Importantly, BioDSA-1K includes non-verifiable hypotheses: cases where the available data are insufficient to support or refute a claim, reflecting a common yet underexplored scenario in real-world science. We propose BioDSA-1K as a foundation for building and evaluating generalizable, trustworthy AI agents for biomedical discovery.
