Table of Contents
Fetching ...

scUnified: An AI-Ready Standardized Resource for Single-Cell RNA Sequencing Analysis

Ping Xu, Zaitian Wang, Zhirui Wang, Pengjiang Li, Ran Zhang, Gaoyang Li, Hanyu Xie, Jiajia Wang, Yuanchun Zhou, Pengfei Wang

TL;DR

The paper tackles reproducibility and benchmarking challenges in single-cell RNA sequencing by introducing scUnified, an AI-ready, standardized resource that compiles 13 high-quality datasets across two species and nine tissues. All data undergo uniform quality control and preprocessing and are stored in the compatible .h5ad format, enabling direct application to clustering, cell-type annotation, and marker identification without additional cleaning. The authors demonstrate utility through cross-method evaluations (Leiden, scMAE, scCDCG) and a knowledge-informed foundation model (GeneCompass), plus case studies on representation learning and marker-based annotation, illustrating cross-dataset validity. This resource lowers technical barriers, supports fair benchmarking, and accelerates AI-driven single-cell research, with plans to expand species, tissues, and multi-omics data.

Abstract

Single-cell RNA sequencing (scRNA-seq) technology enables systematic delineation of cellular states and interactions, providing crucial insights into cellular heterogeneity. Building on this potential, numerous computational methods have been developed for tasks such as cell clustering, cell type annotation, and marker gene identification. To fully assess and compare these methods, standardized, analysis-ready datasets are essential. However, such datasets remain scarce, and variations in data formats, preprocessing workflows, and annotation strategies hinder reproducibility and complicate systematic evaluation of existing methods. To address these challenges, we present scUnified, an AI-ready standardized resource for single-cell RNA sequencing data that consolidates 13 high-quality datasets spanning two species (human and mouse) and nine tissue types. All datasets undergo standardized quality control and preprocessing and are stored in a uniform format to enable direct application in diverse computational analyses without additional data cleaning. We further demonstrate the utility of scUnified through experimental analyses of representative biological tasks, providing a reproducible foundation for the standardized evaluation of computational methods on a unified dataset.

scUnified: An AI-Ready Standardized Resource for Single-Cell RNA Sequencing Analysis

TL;DR

The paper tackles reproducibility and benchmarking challenges in single-cell RNA sequencing by introducing scUnified, an AI-ready, standardized resource that compiles 13 high-quality datasets across two species and nine tissues. All data undergo uniform quality control and preprocessing and are stored in the compatible .h5ad format, enabling direct application to clustering, cell-type annotation, and marker identification without additional cleaning. The authors demonstrate utility through cross-method evaluations (Leiden, scMAE, scCDCG) and a knowledge-informed foundation model (GeneCompass), plus case studies on representation learning and marker-based annotation, illustrating cross-dataset validity. This resource lowers technical barriers, supports fair benchmarking, and accelerates AI-driven single-cell research, with plans to expand species, tissues, and multi-omics data.

Abstract

Single-cell RNA sequencing (scRNA-seq) technology enables systematic delineation of cellular states and interactions, providing crucial insights into cellular heterogeneity. Building on this potential, numerous computational methods have been developed for tasks such as cell clustering, cell type annotation, and marker gene identification. To fully assess and compare these methods, standardized, analysis-ready datasets are essential. However, such datasets remain scarce, and variations in data formats, preprocessing workflows, and annotation strategies hinder reproducibility and complicate systematic evaluation of existing methods. To address these challenges, we present scUnified, an AI-ready standardized resource for single-cell RNA sequencing data that consolidates 13 high-quality datasets spanning two species (human and mouse) and nine tissue types. All datasets undergo standardized quality control and preprocessing and are stored in a uniform format to enable direct application in diverse computational analyses without additional data cleaning. We further demonstrate the utility of scUnified through experimental analyses of representative biological tasks, providing a reproducible foundation for the standardized evaluation of computational methods on a unified dataset.

Paper Structure

This paper contains 19 sections, 6 figures, 2 tables.

Figures (6)

  • Figure 1: Overview of scUnified: standardized single-cell RNA sequencing datasets across species and tissues, supporting AI-driven biological research and discovery.
  • Figure 2: Data distribution of sample numbers in human and mouse data
  • Figure 3: Dataset distributions by cell count, gene number, clusters, and sparsity.
  • Figure 4: Case study of scCDCG on Muris Limb Muscle, demonstrating integrated representation learning, two-dimensional visualization, and biologically guided cell-type annotation. (e) presents four columns from left to right: Gold-standard labels, results of the Best-mapping annotation, results of the Marker-overlap annotation, and the Gold-standard labels.
  • Figure 5: Case study on the Muris Limb Muscle, summarizing the results of Leiden, scMAE, and scCDCG models, including representation learning, two-dimensional visualization, and marker gene-based cell type annotation. (m)-(o) presents four columns from left to right: Gold-standard labels, results of the Best-mapping annotation, results of the Marker-overlap annotation, and the Gold-standard labels.
  • ...and 1 more figures