SC-Arena: A Natural Language Benchmark for Single-Cell Reasoning with Knowledge-Augmented Evaluation

Jiahao Zhao; Feng Jiang; Shaowei Qin; Zhonghui Zhang; Junhao Liu; Guibing Guo; Hamid Alinejad-Rokny; Min Yang

SC-Arena: A Natural Language Benchmark for Single-Cell Reasoning with Knowledge-Augmented Evaluation

Jiahao Zhao, Feng Jiang, Shaowei Qin, Zhonghui Zhang, Junhao Liu, Guibing Guo, Hamid Alinejad-Rokny, Min Yang

TL;DR

This work presents SC-ARENA, a natural language evaluation framework tailored to single-cell foundation models, which formalizes a virtual cell abstraction that unifies evaluation targets by representing both intrinsic attributes and gene-level interactions.

Abstract

Large language models (LLMs) are increasingly applied in scientific research, offering new capabilities for knowledge discovery and reasoning. In single-cell biology, however, evaluation practices for both general and specialized LLMs remain inadequate: existing benchmarks are fragmented across tasks, adopt formats such as multiple-choice classification that diverge from real-world usage, and rely on metrics lacking interpretability and biological grounding. We present SC-ARENA, a natural language evaluation framework tailored to single-cell foundation models. SC-ARENA formalizes a virtual cell abstraction that unifies evaluation targets by representing both intrinsic attributes and gene-level interactions. Within this paradigm, we define five natural language tasks (cell type annotation, captioning, generation, perturbation prediction, and scientific QA) that probe core reasoning capabilities in cellular biology. To overcome the limitations of brittle string-matching metrics, we introduce knowledge-augmented evaluation, which incorporates external ontologies, marker databases, and scientific literature to support biologically faithful and interpretable judgments. Experiments and analysis across both general-purpose and domain-specialized LLMs demonstrate that (i) under the Virtual Cell unified evaluation paradigm, current models achieve uneven performance on biologically complex tasks, particularly those demanding mechanistic or causal understanding; and (ii) our knowledge-augmented evaluation framework ensures biological correctness, provides interpretable, evidence-grounded rationales, and achieves high discriminative capacity, overcoming the brittleness and opacity of conventional metrics. SC-Arena thus provides a unified and interpretable framework for assessing LLMs in single-cell biology, pointing toward the development of biology-aligned, generalizable foundation models.

SC-Arena: A Natural Language Benchmark for Single-Cell Reasoning with Knowledge-Augmented Evaluation

TL;DR

Abstract

Paper Structure (51 sections, 4 equations, 14 figures, 11 tables)

This paper contains 51 sections, 4 equations, 14 figures, 11 tables.

Introduction
Related Work
Single-Cell Modeling Approaches
Single-Cell Benchmarks
The SC-ARENA Evaluation Framework
Knowledge Cell Class: Defining the Participant as a Virtual Cell
Attributes
Methods
Multi-task Benchmark with Formal Examination
Knowledge-Augmented Evaluation
Experiments
Benchmark Dataset Construction
External Knowledge for Evaluation
Experiment Setup
Benchmarking Results
...and 36 more sections

Figures (14)

Figure 1: Overview of the SC-ARENA framework.
Figure 2: (a) Relationship between prediction score and ontology distance (Spearman $\rho = 0.6212$, p < 0.001); (b) Example scoring responses using external knowledge.
Figure 3: Radar-plot comparison of representative general-purpose models across the five SC-ARENA tasks: cell type annotation, perturbation prediction, cell generation, cell captioning, and scientific QA. The visualization highlights the uneven distribution of gains: while models such as Kimi-K2 and DeepSeek-R1 excel in captioning and generation, Qwen3-32B performs comparatively better in perturbation prediction. The radar plot provides a task-level perspective that complements aggregate scores and illustrates persistent challenges in mechanistic reasoning.
Figure 4: Distribution of ontology path length to root for predicted cell types across models. The x-axis shows binned intervals of the average path length to the ontology root, using left-closed, right-open notation [a,b), and the y-axis indicates the number of predicted cell types falling into each interval. Shorter path lengths indicate closer alignment with the ontology hierarchy and thus more specific predictions.
Figure 5: Cell Type Annotation Answer Generation Prompt.
...and 9 more figures

SC-Arena: A Natural Language Benchmark for Single-Cell Reasoning with Knowledge-Augmented Evaluation

TL;DR

Abstract

SC-Arena: A Natural Language Benchmark for Single-Cell Reasoning with Knowledge-Augmented Evaluation

Authors

TL;DR

Abstract

Table of Contents

Figures (14)