Table of Contents
Fetching ...

BABE: Biology Arena BEnchmark

Junting Zhou, Jin Chen, Linfeng Hao, Denghui Cao, Zheyu Wang, Qiguang Chen, Chaoyou Fu, Jiaze Chen, Yuchen Wu, Ge Zhang, Mingxuan Wang, Wenhao Huang, Tong Yang

TL;DR

BABE addresses a gap in biology AI benchmarks by focusing on experimental reasoning that integrates results with context. It frames evaluation around a single source document $D$ using a triplet of questions $Q_{BABE}$ with correlation relations to diagnose sequential versus parallel reasoning. The data pipeline blends frontier literature curation, expert item development, and multi-stage quality assurance to produce high-quality, research-derived tasks across 12 biology subfields. Findings show that deeper, sustained reasoning and multi-trial inference improve performance, highlighting the benchmark's potential to guide the development of biologically grounded, discovery-oriented AI.

Abstract

The rapid evolution of large language models (LLMs) has expanded their capabilities from basic dialogue to advanced scientific reasoning. However, existing benchmarks in biology often fail to assess a critical skill required of researchers: the ability to integrate experimental results with contextual knowledge to derive meaningful conclusions. To address this gap, we introduce BABE(Biology Arena BEnchmark), a comprehensive benchmark designed to evaluate the experimental reasoning capabilities of biological AI systems. BABE is uniquely constructed from peer-reviewed research papers and real-world biological studies, ensuring that tasks reflect the complexity and interdisciplinary nature of actual scientific inquiry. BABE challenges models to perform causal reasoning and cross-scale inference. Our benchmark provides a robust framework for assessing how well AI systems can reason like practicing scientists, offering a more authentic measure of their potential to contribute to biological research.

BABE: Biology Arena BEnchmark

TL;DR

BABE addresses a gap in biology AI benchmarks by focusing on experimental reasoning that integrates results with context. It frames evaluation around a single source document using a triplet of questions with correlation relations to diagnose sequential versus parallel reasoning. The data pipeline blends frontier literature curation, expert item development, and multi-stage quality assurance to produce high-quality, research-derived tasks across 12 biology subfields. Findings show that deeper, sustained reasoning and multi-trial inference improve performance, highlighting the benchmark's potential to guide the development of biologically grounded, discovery-oriented AI.

Abstract

The rapid evolution of large language models (LLMs) has expanded their capabilities from basic dialogue to advanced scientific reasoning. However, existing benchmarks in biology often fail to assess a critical skill required of researchers: the ability to integrate experimental results with contextual knowledge to derive meaningful conclusions. To address this gap, we introduce BABE(Biology Arena BEnchmark), a comprehensive benchmark designed to evaluate the experimental reasoning capabilities of biological AI systems. BABE is uniquely constructed from peer-reviewed research papers and real-world biological studies, ensuring that tasks reflect the complexity and interdisciplinary nature of actual scientific inquiry. BABE challenges models to perform causal reasoning and cross-scale inference. Our benchmark provides a robust framework for assessing how well AI systems can reason like practicing scientists, offering a more authentic measure of their potential to contribute to biological research.
Paper Structure (18 sections, 6 equations, 11 figures, 3 tables)

This paper contains 18 sections, 6 equations, 11 figures, 3 tables.

Figures (11)

  • Figure 1: Overview of the BABE (Biology Arena BEnchmark) construction and composition. (A) The multi-stage annotation pipeline for constructing the BABE benchmark. (B) The disciplinary distribution of questions in the BABE benchmark, covering 12 subfields of biology. (C) The proportion of strong-correlation (45%) and weak-correlation (55%) questions in the final BABE benchmark.
  • Figure 2: The Reasoning Behavior Distribution on BABE across four LLMs.
  • Figure 3: Performance gain from multi-trial inference.
  • Figure 4: Example Question 1 of BABE
  • Figure 5: Example Question 2 of BABE
  • ...and 6 more figures