Table of Contents
Fetching ...

Single-Cell Omics Arena: A Benchmark Study for Large Language Models on Cell Type Annotation Using Single-Cell Data

Junhao Liu, Siwei Xu, Lei Zhang, Jing Zhang

TL;DR

The paper addresses automating cell type annotation in single-cell omics by benchmarking instruction-tuned LLMs on SOAR, a two-part dataset including SOAR-RNA for scRNA-seq and SOAR-MultiOmics for multiomics data. It introduces two zero-shot prompting strategies, including chain-of-thought, and a cross-modality alignment module to translate ATAC-seq to RNA-seq for LLM reasoning. Key findings show that open-source LLMs with CoT prompting can match or exceed domain-specific baselines without fine-tuning, and cross-modality translation enables effective multiomics annotation. This work highlights the potential of LLMs to automate complex genomics analyses and provides a foundation for broader omics benchmarks and retrieval-augmented strategies in cell type labeling.

Abstract

Over the past decade, the revolution in single-cell sequencing has enabled the simultaneous molecular profiling of various modalities across thousands of individual cells, allowing scientists to investigate the diverse functions of complex tissues and uncover underlying disease mechanisms. Among all the analytical steps, assigning individual cells to specific types is fundamental for understanding cellular heterogeneity. However, this process is usually labor-intensive and requires extensive expert knowledge. Recent advances in large language models (LLMs) have demonstrated their ability to efficiently process and synthesize vast corpora of text to automatically extract essential biological knowledge, such as marker genes, potentially promoting more efficient and automated cell type annotations. To thoroughly evaluate the capability of modern instruction-tuned LLMs in automating the cell type identification process, we introduce SOAR, a comprehensive benchmarking study of LLMs for cell type annotation tasks in single-cell genomics. Specifically, we assess the performance of 8 instruction-tuned LLMs across 11 datasets, spanning multiple cell types and species. Our study explores the potential of LLMs to accurately classify and annotate cell types in single-cell RNA sequencing (scRNA-seq) data, while extending their application to multiomics data through cross-modality translation. Additionally, we evaluate the effectiveness of chain-of-thought (CoT) prompting techniques in generating detailed biological insights during the annotation process. The results demonstrate that LLMs can provide robust interpretations of single-cell data without requiring additional fine-tuning, advancing the automation of cell type annotation in genomics research.

Single-Cell Omics Arena: A Benchmark Study for Large Language Models on Cell Type Annotation Using Single-Cell Data

TL;DR

The paper addresses automating cell type annotation in single-cell omics by benchmarking instruction-tuned LLMs on SOAR, a two-part dataset including SOAR-RNA for scRNA-seq and SOAR-MultiOmics for multiomics data. It introduces two zero-shot prompting strategies, including chain-of-thought, and a cross-modality alignment module to translate ATAC-seq to RNA-seq for LLM reasoning. Key findings show that open-source LLMs with CoT prompting can match or exceed domain-specific baselines without fine-tuning, and cross-modality translation enables effective multiomics annotation. This work highlights the potential of LLMs to automate complex genomics analyses and provides a foundation for broader omics benchmarks and retrieval-augmented strategies in cell type labeling.

Abstract

Over the past decade, the revolution in single-cell sequencing has enabled the simultaneous molecular profiling of various modalities across thousands of individual cells, allowing scientists to investigate the diverse functions of complex tissues and uncover underlying disease mechanisms. Among all the analytical steps, assigning individual cells to specific types is fundamental for understanding cellular heterogeneity. However, this process is usually labor-intensive and requires extensive expert knowledge. Recent advances in large language models (LLMs) have demonstrated their ability to efficiently process and synthesize vast corpora of text to automatically extract essential biological knowledge, such as marker genes, potentially promoting more efficient and automated cell type annotations. To thoroughly evaluate the capability of modern instruction-tuned LLMs in automating the cell type identification process, we introduce SOAR, a comprehensive benchmarking study of LLMs for cell type annotation tasks in single-cell genomics. Specifically, we assess the performance of 8 instruction-tuned LLMs across 11 datasets, spanning multiple cell types and species. Our study explores the potential of LLMs to accurately classify and annotate cell types in single-cell RNA sequencing (scRNA-seq) data, while extending their application to multiomics data through cross-modality translation. Additionally, we evaluate the effectiveness of chain-of-thought (CoT) prompting techniques in generating detailed biological insights during the annotation process. The results demonstrate that LLMs can provide robust interpretations of single-cell data without requiring additional fine-tuning, advancing the automation of cell type annotation in genomics research.

Paper Structure

This paper contains 33 sections, 12 equations, 10 figures, 9 tables.

Figures (10)

  • Figure 1: The illustration of prompting LLMs to finish the cell type annotation task.
  • Figure 2: The EM and F1 evaluation results on the SOAR-RNA benchmark using two zero-shot prompting strategies.
  • Figure 3: The BLEU evaluation result per tissue of the SOAR-RNA benchmark.
  • Figure 4: The F1 evaluation results on the SOAR-MultiOmics benchmark.
  • Figure 5: The statistics results of percentages of tissue in the SOAR-RNA dataset.
  • ...and 5 more figures