Table of Contents
Fetching ...

scBench: Evaluating AI Agents on Single-Cell RNA-seq Analysis

Kenny Workman, Zhen Yang, Harihara Muralidharan, Aidan Abdulali, Hannah Le

TL;DR

scBench provides a deterministic, data-grounded benchmark to evaluate AI agents on real-world scRNA-seq analysis tasks across six platforms and seven categories. By pairing data snapshots with natural-language prompts and deterministic graders, it reveals substantial, platform-dependent variation in agent performance, with a top accuracy of 52.8% and a 23.6-point model spread. The benchmark highlights that normalization and QC are relatively reliable, while differential expression and cell typing remain judgment-heavy and challenging, especially on less-documented platforms. Together with SpatialBench, scBench establishes a measurement and diagnostic framework to guide test-driven development of scRNA-seq analysis agents and to inform the design of platform-aware tooling for faithful, reproducible biological inference.

Abstract

As single-cell RNA sequencing datasets grow in adoption, scale, and complexity, data analysis remains a bottleneck for many research groups. Although frontier AI agents have improved dramatically at software engineering and general data analysis, it remains unclear whether they can extract biological insight from messy, real-world single-cell datasets. We introduce scBench, a benchmark of 394 verifiable problems derived from practical scRNA-seq workflows spanning six sequencing platforms and seven task categories. Each problem provides a snapshot of experimental data immediately prior to an analysis step and a deterministic grader that evaluates recovery of a key biological result. Benchmark data on eight frontier models shows that accuracy ranges from 29-53%, with strong model-task and model-platform interactions. Platform choice affects accuracy as much as model choice, with 40+ percentage point drops on less-documented technologies. scBench complements SpatialBench to cover the two dominant single-cell modalities, serving both as a measurement tool and a diagnostic lens for developing agents that can analyze real scRNA-seq datasets faithfully and reproducibly.

scBench: Evaluating AI Agents on Single-Cell RNA-seq Analysis

TL;DR

scBench provides a deterministic, data-grounded benchmark to evaluate AI agents on real-world scRNA-seq analysis tasks across six platforms and seven categories. By pairing data snapshots with natural-language prompts and deterministic graders, it reveals substantial, platform-dependent variation in agent performance, with a top accuracy of 52.8% and a 23.6-point model spread. The benchmark highlights that normalization and QC are relatively reliable, while differential expression and cell typing remain judgment-heavy and challenging, especially on less-documented platforms. Together with SpatialBench, scBench establishes a measurement and diagnostic framework to guide test-driven development of scRNA-seq analysis agents and to inform the design of platform-aware tooling for faithful, reproducible biological inference.

Abstract

As single-cell RNA sequencing datasets grow in adoption, scale, and complexity, data analysis remains a bottleneck for many research groups. Although frontier AI agents have improved dramatically at software engineering and general data analysis, it remains unclear whether they can extract biological insight from messy, real-world single-cell datasets. We introduce scBench, a benchmark of 394 verifiable problems derived from practical scRNA-seq workflows spanning six sequencing platforms and seven task categories. Each problem provides a snapshot of experimental data immediately prior to an analysis step and a deterministic grader that evaluates recovery of a key biological result. Benchmark data on eight frontier models shows that accuracy ranges from 29-53%, with strong model-task and model-platform interactions. Platform choice affects accuracy as much as model choice, with 40+ percentage point drops on less-documented technologies. scBench complements SpatialBench to cover the two dominant single-cell modalities, serving both as a measurement tool and a diagnostic lens for developing agents that can analyze real scRNA-seq datasets faithfully and reproducibly.
Paper Structure (43 sections, 1 equation, 6 figures, 6 tables)

This paper contains 43 sections, 1 equation, 6 figures, 6 tables.

Figures (6)

  • Figure 1: Distribution of 394 evaluations across platforms and task categories. Cell typing and differential expression dominate; ParseBio lacks QC evaluations.
  • Figure 2: Aggregate accuracy of 8 frontier models on scBench (394 evaluations, 3 replicates each). Error bars show 95% confidence intervals computed via two-stage aggregation with the $t$-distribution.
  • Figure 3: Accuracy versus cost (left) and latency (right). Dashed lines connect Pareto-optimal models. GPT-5.2 achieves near-top accuracy at lower cost; Opus 4.6 leads accuracy but incurs higher cost and latency.
  • Figure 4: Accuracy (%) by model and task category. Tasks ordered by difficulty (normalization easiest, differential expression hardest). Error bars show 95% confidence intervals. The difficulty gradient is consistent across models.
  • Figure 5: Accuracy (%) by sequencing platform. Platforms ordered by decreasing cross-model mean accuracy. Error bars show 95% confidence intervals.
  • ...and 1 more figures