Table of Contents
Fetching ...

BixBench: a Comprehensive Benchmark for LLM-based Agents in Computational Biology

Ludovico Mitchener, Jon M Laurent, Alex Andonian, Benjamin Tenmann, Siddharth Narayanan, Geemi P Wellawatte, Andrew White, Lorenzo Sani, Samuel G Rodriques

TL;DR

BixBench introduces a real-world, open-ended benchmark for evaluating LLM-based agents in bioinformatics, featuring 61 analytical capsules and 205 open-answer questions to assess long, multi-step data analyses. The authors provide an Aviary-based, Docker-reproducible evaluation framework and demonstrate that current frontier models (GPT-4o, Claude 3.5 Sonnet) perform poorly in open-ended tasks (≈21% accuracy) and near random in MCQ without abstention, underscoring the gap to autonomous bioinformaticians. The work spans capsule construction, expert curation, MCQ generation, and rigorous evaluation, establishing a resource to drive development of robust, autonomous biological data analysis agents. Overall, BixBench highlights critical limitations in present LLM capabilities for rigorous bioinformatics research and offers a structured path toward advancing autonomous scientific discovery.

Abstract

Large Language Models (LLMs) and LLM-based agents show great promise in accelerating scientific research. Existing benchmarks for measuring this potential and guiding future development continue to evolve from pure recall and rote knowledge tasks, towards more practical work such as literature review and experimental planning. Bioinformatics is a domain where fully autonomous AI-driven discovery may be near, but no extensive benchmarks for measuring progress have been introduced to date. We therefore present the Bioinformatics Benchmark (BixBench), a dataset comprising over 50 real-world scenarios of practical biological data analysis with nearly 300 associated open-answer questions designed to measure the ability of LLM-based agents to explore biological datasets, perform long, multi-step analytical trajectories, and interpret the nuanced results of those analyses. We evaluate the performance of two frontier LLMs (GPT-4o and Claude 3.5 Sonnet) using a custom agent framework we open source. We find that even the latest frontier models only achieve 17% accuracy in the open-answer regime, and no better than random in a multiple-choice setting. By exposing the current limitations of frontier models, we hope BixBench can spur the development of agents capable of conducting rigorous bioinformatic analysis and accelerate scientific discovery.

BixBench: a Comprehensive Benchmark for LLM-based Agents in Computational Biology

TL;DR

BixBench introduces a real-world, open-ended benchmark for evaluating LLM-based agents in bioinformatics, featuring 61 analytical capsules and 205 open-answer questions to assess long, multi-step data analyses. The authors provide an Aviary-based, Docker-reproducible evaluation framework and demonstrate that current frontier models (GPT-4o, Claude 3.5 Sonnet) perform poorly in open-ended tasks (≈21% accuracy) and near random in MCQ without abstention, underscoring the gap to autonomous bioinformaticians. The work spans capsule construction, expert curation, MCQ generation, and rigorous evaluation, establishing a resource to drive development of robust, autonomous biological data analysis agents. Overall, BixBench highlights critical limitations in present LLM capabilities for rigorous bioinformatics research and offers a structured path toward advancing autonomous scientific discovery.

Abstract

Large Language Models (LLMs) and LLM-based agents show great promise in accelerating scientific research. Existing benchmarks for measuring this potential and guiding future development continue to evolve from pure recall and rote knowledge tasks, towards more practical work such as literature review and experimental planning. Bioinformatics is a domain where fully autonomous AI-driven discovery may be near, but no extensive benchmarks for measuring progress have been introduced to date. We therefore present the Bioinformatics Benchmark (BixBench), a dataset comprising over 50 real-world scenarios of practical biological data analysis with nearly 300 associated open-answer questions designed to measure the ability of LLM-based agents to explore biological datasets, perform long, multi-step analytical trajectories, and interpret the nuanced results of those analyses. We evaluate the performance of two frontier LLMs (GPT-4o and Claude 3.5 Sonnet) using a custom agent framework we open source. We find that even the latest frontier models only achieve 17% accuracy in the open-answer regime, and no better than random in a multiple-choice setting. By exposing the current limitations of frontier models, we hope BixBench can spur the development of agents capable of conducting rigorous bioinformatic analysis and accelerate scientific discovery.

Paper Structure

This paper contains 24 sections, 8 figures, 1 table.

Figures (8)

  • Figure 1: BixBench benchmark creation diagram. (A) To create the initial seed capsules, expert bioinformaticians assembled a code notebook and provided input data and metadata including a hypothesis, result, etc. Seed capsules were reviewed by other experts before being merged to final corpus. (B) To generate benchmark tasks, we first asked an LLM to propose candidate questions for each capsule. These questions were reviewed by multiple experts, yielding the final dataset.
  • Figure 2: Histogram of capsules in a broad set of self-selected analytical categories.
  • Figure 3: BixBench evaluation. An agent is provided a task capsule, consisting of data and associated questions. The agent environment contains an empty code notebook in a docker environment pre-loaded with many popular bioinformatics packages, with three tools to manipulate the environment. In the open-response regime (A) the agent's submitted answer is compared directly to the ground truth answer by a separate LLM call. In the MCQ regime (B) a second LLM is given the agent notebook and answer, and asked to choose from the available options.
  • Figure 4: Overall model performance. Models perform poorly in the purely open-answer regime (left bars) and display increasing performance in the MCQ with refusal regime and further increased performance with MCQs without a refusal option. Performance does not surpass a baseline assessed as performance on the questions given to the model without access to any analysis notebook to base answers in (i.e. pure model recall.)
  • Figure 5: Performance of models in the MCQ regime relative to number of votes with ablations of refusal (top plot) and image generation (bottom.) Note that in the vision ablation the refusal option is present and the agents are prompt engineered to not produce images/plots. In the refusal ablation agents are free to use images/plots.
  • ...and 3 more figures