scUnified: An AI-Ready Standardized Resource for Single-Cell RNA Sequencing Analysis
Ping Xu, Zaitian Wang, Zhirui Wang, Pengjiang Li, Ran Zhang, Gaoyang Li, Hanyu Xie, Jiajia Wang, Yuanchun Zhou, Pengfei Wang
TL;DR
The paper tackles reproducibility and benchmarking challenges in single-cell RNA sequencing by introducing scUnified, an AI-ready, standardized resource that compiles 13 high-quality datasets across two species and nine tissues. All data undergo uniform quality control and preprocessing and are stored in the compatible .h5ad format, enabling direct application to clustering, cell-type annotation, and marker identification without additional cleaning. The authors demonstrate utility through cross-method evaluations (Leiden, scMAE, scCDCG) and a knowledge-informed foundation model (GeneCompass), plus case studies on representation learning and marker-based annotation, illustrating cross-dataset validity. This resource lowers technical barriers, supports fair benchmarking, and accelerates AI-driven single-cell research, with plans to expand species, tissues, and multi-omics data.
Abstract
Single-cell RNA sequencing (scRNA-seq) technology enables systematic delineation of cellular states and interactions, providing crucial insights into cellular heterogeneity. Building on this potential, numerous computational methods have been developed for tasks such as cell clustering, cell type annotation, and marker gene identification. To fully assess and compare these methods, standardized, analysis-ready datasets are essential. However, such datasets remain scarce, and variations in data formats, preprocessing workflows, and annotation strategies hinder reproducibility and complicate systematic evaluation of existing methods. To address these challenges, we present scUnified, an AI-ready standardized resource for single-cell RNA sequencing data that consolidates 13 high-quality datasets spanning two species (human and mouse) and nine tissue types. All datasets undergo standardized quality control and preprocessing and are stored in a uniform format to enable direct application in diverse computational analyses without additional data cleaning. We further demonstrate the utility of scUnified through experimental analyses of representative biological tasks, providing a reproducible foundation for the standardized evaluation of computational methods on a unified dataset.
