Table of Contents
Fetching ...

Deep Learning in Single-Cell and Spatial Transcriptomics Data Analysis: Advances and Challenges from a Data Science Perspective

Shuang Ge, Shuqing Sun, Huan Xu, Qiang Cheng, Zhixiang Ren

TL;DR

This paper surveys deep learning approaches for single-cell and spatial transcriptomics from a data science perspective, focusing on four core challenges: data sparsity, diversity, scarcity, and correlation. It analyzes DL methods across data representations, multimodal/multi-source integration, data generation, and prior-knowledge incorporation, and benchmarks 58 methods on 21 datasets from 9 benchmarks. The authors also curate datasets and propose evaluation strategies, highlighting gaps in benchmark design and the need for biologically relevant metrics. They foresee advances from novel AI paradigms (foundation models, self-supervised learning) and improved benchmarks, with practical impact on biology and precision medicine.

Abstract

The development of single-cell and spatial transcriptomics has revolutionized our capacity to investigate cellular properties, functions, and interactions in both cellular and spatial contexts. However, the analysis of single-cell and spatial omics data remains challenging. First, single-cell sequencing data are high-dimensional and sparse, often contaminated by noise and uncertainty, obscuring the underlying biological signals. Second, these data often encompass multiple modalities, including gene expression, epigenetic modifications, and spatial locations. Integrating these diverse data modalities is crucial for enhancing prediction accuracy and biological interpretability. Third, while the scale of single-cell sequencing has expanded to millions of cells, high-quality annotated datasets are still limited. Fourth, the complex correlations of biological tissues make it difficult to accurately reconstruct cellular states and spatial contexts. Traditional feature engineering-based analysis methods struggle to deal with the various challenges presented by intricate biological networks. Deep learning has emerged as a powerful tool capable of handling high-dimensional complex data and automatically identifying meaningful patterns, offering significant promise in addressing these challenges. This review systematically analyzes these challenges and discusses related deep learning approaches. Moreover, we have curated 21 datasets from 9 benchmarks, encompassing 58 computational methods, and evaluated their performance on the respective modeling tasks. Finally, we highlight three areas for future development from a technical, dataset, and application perspective. This work will serve as a valuable resource for understanding how deep learning can be effectively utilized in single-cell and spatial transcriptomics analyses, while inspiring novel approaches to address emerging challenges.

Deep Learning in Single-Cell and Spatial Transcriptomics Data Analysis: Advances and Challenges from a Data Science Perspective

TL;DR

This paper surveys deep learning approaches for single-cell and spatial transcriptomics from a data science perspective, focusing on four core challenges: data sparsity, diversity, scarcity, and correlation. It analyzes DL methods across data representations, multimodal/multi-source integration, data generation, and prior-knowledge incorporation, and benchmarks 58 methods on 21 datasets from 9 benchmarks. The authors also curate datasets and propose evaluation strategies, highlighting gaps in benchmark design and the need for biologically relevant metrics. They foresee advances from novel AI paradigms (foundation models, self-supervised learning) and improved benchmarks, with practical impact on biology and precision medicine.

Abstract

The development of single-cell and spatial transcriptomics has revolutionized our capacity to investigate cellular properties, functions, and interactions in both cellular and spatial contexts. However, the analysis of single-cell and spatial omics data remains challenging. First, single-cell sequencing data are high-dimensional and sparse, often contaminated by noise and uncertainty, obscuring the underlying biological signals. Second, these data often encompass multiple modalities, including gene expression, epigenetic modifications, and spatial locations. Integrating these diverse data modalities is crucial for enhancing prediction accuracy and biological interpretability. Third, while the scale of single-cell sequencing has expanded to millions of cells, high-quality annotated datasets are still limited. Fourth, the complex correlations of biological tissues make it difficult to accurately reconstruct cellular states and spatial contexts. Traditional feature engineering-based analysis methods struggle to deal with the various challenges presented by intricate biological networks. Deep learning has emerged as a powerful tool capable of handling high-dimensional complex data and automatically identifying meaningful patterns, offering significant promise in addressing these challenges. This review systematically analyzes these challenges and discusses related deep learning approaches. Moreover, we have curated 21 datasets from 9 benchmarks, encompassing 58 computational methods, and evaluated their performance on the respective modeling tasks. Finally, we highlight three areas for future development from a technical, dataset, and application perspective. This work will serve as a valuable resource for understanding how deep learning can be effectively utilized in single-cell and spatial transcriptomics analyses, while inspiring novel approaches to address emerging challenges.

Paper Structure

This paper contains 28 sections, 3 equations, 14 figures, 3 tables.

Figures (14)

  • Figure 1: The overall structure of the article is organized into three main sections. (a) An overview of key sequencing technologies in single-cell and spatial transcriptomics; (b) A discussion of four significant scientific and technical challenges within the field from a data science perspective, namely: data sparsity, data diversity, data scarcity, and data correlation; (c) An exploration of potential future perspectives that includes innovative AI methodologies, benchmark datasets and evaluation metrics, as well as applications of DL in practical scenarios. Some components of this figure are drawn by Figdraw.
  • Figure 2: The sequencing pipeline for single cell and spatial transcriptomics data. (a) Bulk-based technique provides average gene expression profiles at the tissue level, with cell proportions estimated through deconvolution methods. (b) Microfluidic-based techniques isolate individual cells into droplets or wells, followed by barcoding and sequencing. (c) Spatial barcode-based techniques utilize cell barcodes to capture poly-adenylated RNA molecules in situ before reverse transcription. (d) Targeted in situ sequencing employs specifically designed probes to bind RNA or cDNA targets, leveraging in situ spatial information.
  • Figure 3: Revisualize the benchmark results for data imputation from five benchmark datasetsdai2022scimcbai2024sae. In benchmark 1 (dataset 1 and 2), 'clustering' represents the average value of clustering evaluation metrics, including NMI and ARI, while 'consistency' includes PCC. In benchmark 2 (dataset 3-5), 'clustering' represents the mean of the NMI and ARI, and 'consistancy' refers to the mean metrics of F1, AUC and ACC. The green rectangle indicates the largest point size (imputation consistency), while the orange rectangle represents the highest color value (clustering performance).
  • Figure 4: The structure of section "Data Sparsity" and related methods. The tree chart outlines the challenges associated with processing sparse single-cell data, focusing on issues including the curse of dimensionality, noise, and uncertainty.
  • Figure 5: The challenges and typical approaches for data sparsity. Neural networks often modeling data representations in complex latent spaces, particularly in scenarios with increased factors of variability, such as uncertainties in experimental processes. (a) For curse of dimensionality, we plotted the framework of scvisding2018interpretable. (b) For batch correction, the neural networks (CLEARhan2022self) shares the same objective as the nearest neighbors matching (NNMhaghverdi2018batch). Both approaches use distance to measure the similarity between samples, facilitating the clustering of samples of the same type across different batches. (c) For imputation, the VAE-based method (DCAeraslan2019single) is used for noise separation through data reconstruction. (d) For modeling uncertainty, scVI incorporates stochastic factors inherent in the sequencing process, providing a framework to better capture and account for variability in the data.
  • ...and 9 more figures