BABE: Biology Arena BEnchmark

Junting Zhou; Jin Chen; Linfeng Hao; Denghui Cao; Zheyu Wang; Qiguang Chen; Chaoyou Fu; Jiaze Chen; Yuchen Wu; Ge Zhang; Mingxuan Wang; Wenhao Huang; Tong Yang

BABE: Biology Arena BEnchmark

Junting Zhou, Jin Chen, Linfeng Hao, Denghui Cao, Zheyu Wang, Qiguang Chen, Chaoyou Fu, Jiaze Chen, Yuchen Wu, Ge Zhang, Mingxuan Wang, Wenhao Huang, Tong Yang

TL;DR

BABE addresses a gap in biology AI benchmarks by focusing on experimental reasoning that integrates results with context. It frames evaluation around a single source document $D$ using a triplet of questions $Q_{BABE}$ with correlation relations to diagnose sequential versus parallel reasoning. The data pipeline blends frontier literature curation, expert item development, and multi-stage quality assurance to produce high-quality, research-derived tasks across 12 biology subfields. Findings show that deeper, sustained reasoning and multi-trial inference improve performance, highlighting the benchmark's potential to guide the development of biologically grounded, discovery-oriented AI.

Abstract

The rapid evolution of large language models (LLMs) has expanded their capabilities from basic dialogue to advanced scientific reasoning. However, existing benchmarks in biology often fail to assess a critical skill required of researchers: the ability to integrate experimental results with contextual knowledge to derive meaningful conclusions. To address this gap, we introduce BABE(Biology Arena BEnchmark), a comprehensive benchmark designed to evaluate the experimental reasoning capabilities of biological AI systems. BABE is uniquely constructed from peer-reviewed research papers and real-world biological studies, ensuring that tasks reflect the complexity and interdisciplinary nature of actual scientific inquiry. BABE challenges models to perform causal reasoning and cross-scale inference. Our benchmark provides a robust framework for assessing how well AI systems can reason like practicing scientists, offering a more authentic measure of their potential to contribute to biological research.

BABE: Biology Arena BEnchmark

TL;DR

BABE addresses a gap in biology AI benchmarks by focusing on experimental reasoning that integrates results with context. It frames evaluation around a single source document

using a triplet of questions

with correlation relations to diagnose sequential versus parallel reasoning. The data pipeline blends frontier literature curation, expert item development, and multi-stage quality assurance to produce high-quality, research-derived tasks across 12 biology subfields. Findings show that deeper, sustained reasoning and multi-trial inference improve performance, highlighting the benchmark's potential to guide the development of biologically grounded, discovery-oriented AI.

Abstract

Paper Structure (18 sections, 6 equations, 11 figures, 3 tables)

This paper contains 18 sections, 6 equations, 11 figures, 3 tables.

Introduction
Related Work
Deep Research Agents
Scientific Benchmarks
Biology-Specific Benchmarks
Approach
Problem Formulation
Data Collection
Experiments
Overall Performance Analysis
Strong vs. Weak Correlation
Reasoning Behavior Analysis on BABE
BABE requires deeper reasoning.
Excessive self-reflection on BABE can lead to a substantial degradation in reasoning performance.
Strong performance on BABE depends on sustained, evenly applied deep reasoning.
...and 3 more sections

Figures (11)

Figure 1: Overview of the BABE (Biology Arena BEnchmark) construction and composition. (A) The multi-stage annotation pipeline for constructing the BABE benchmark. (B) The disciplinary distribution of questions in the BABE benchmark, covering 12 subfields of biology. (C) The proportion of strong-correlation (45%) and weak-correlation (55%) questions in the final BABE benchmark.
Figure 2: The Reasoning Behavior Distribution on BABE across four LLMs.
Figure 3: Performance gain from multi-trial inference.
Figure 4: Example Question 1 of BABE
Figure 5: Example Question 2 of BABE
...and 6 more figures

BABE: Biology Arena BEnchmark

TL;DR

Abstract

BABE: Biology Arena BEnchmark

Authors

TL;DR

Abstract

Table of Contents

Figures (11)