Table of Contents
Fetching ...

scDiffusion: conditional generation of high-quality single-cell data using diffusion model

Erpai Luo, Minsheng Hao, Lei Wei, Xuegong Zhang

Abstract

Single-cell RNA sequencing (scRNA-seq) data are important for studying the laws of life at single-cell level. However, it is still challenging to obtain enough high-quality scRNA-seq data. To mitigate the limited availability of data, generative models have been proposed to computationally generate synthetic scRNA-seq data. Nevertheless, the data generated with current models are not very realistic yet, especially when we need to generate data with controlled conditions. In the meantime, the Diffusion models have shown their power in generating data at high fidelity, providing a new opportunity for scRNA-seq generation. In this study, we developed scDiffusion, a generative model combining diffusion model and foundation model to generate high-quality scRNA-seq data with controlled conditions. We designed multiple classifiers to guide the diffusion process simultaneously, enabling scDiffusion to generate data under multiple condition combinations. We also proposed a new control strategy called Gradient Interpolation. This strategy allows the model to generate continuous trajectories of cell development from a given cell state. Experiments showed that scDiffusion can generate single-cell gene expression data closely resembling real scRNA-seq data. Also, scDiffusion can conditionally produce data on specific cell types including rare cell types. Furthermore, we could use the multiple-condition generation of scDiffusion to generate cell type that was out of the training data. Leveraging the Gradient Interpolation strategy, we generated a continuous developmental trajectory of mouse embryonic cells. These experiments demonstrate that scDiffusion is a powerful tool for augmenting the real scRNA-seq data and can provide insights into cell fate research.

scDiffusion: conditional generation of high-quality single-cell data using diffusion model

Abstract

Single-cell RNA sequencing (scRNA-seq) data are important for studying the laws of life at single-cell level. However, it is still challenging to obtain enough high-quality scRNA-seq data. To mitigate the limited availability of data, generative models have been proposed to computationally generate synthetic scRNA-seq data. Nevertheless, the data generated with current models are not very realistic yet, especially when we need to generate data with controlled conditions. In the meantime, the Diffusion models have shown their power in generating data at high fidelity, providing a new opportunity for scRNA-seq generation. In this study, we developed scDiffusion, a generative model combining diffusion model and foundation model to generate high-quality scRNA-seq data with controlled conditions. We designed multiple classifiers to guide the diffusion process simultaneously, enabling scDiffusion to generate data under multiple condition combinations. We also proposed a new control strategy called Gradient Interpolation. This strategy allows the model to generate continuous trajectories of cell development from a given cell state. Experiments showed that scDiffusion can generate single-cell gene expression data closely resembling real scRNA-seq data. Also, scDiffusion can conditionally produce data on specific cell types including rare cell types. Furthermore, we could use the multiple-condition generation of scDiffusion to generate cell type that was out of the training data. Leveraging the Gradient Interpolation strategy, we generated a continuous developmental trajectory of mouse embryonic cells. These experiments demonstrate that scDiffusion is a powerful tool for augmenting the real scRNA-seq data and can provide insights into cell fate research.
Paper Structure (13 sections, 7 equations, 11 figures, 2 tables)

This paper contains 13 sections, 7 equations, 11 figures, 2 tables.

Figures (11)

  • Figure 1: The overall structure of scDiffusion.
  • Figure 2: scDiffusion can generate realistic cell data. (a) The training loss curve of fine-tuning autoencoder from pre-trained SCimilarity weight and training autoencoder from scratch. (b) UMAP of scDiffusion-generated Tabular Muris data and real Tabular Muris data. (c) UMAP of scDiffusion-generated Human Lung PF data and real Human Lung PF data. (d) UMAP of scDiffusion-generated PBMC68k data and real PBMC68k data.
  • Figure 3: (a) UMAP of different cell types in the Tabular Muris dataset generated by conditional diffusion. The Thymus cell is a rare cell type. (b) The AUC score of KNN in different cell types in the Tabular Muris dataset. (c) UMAP of different cell types in the PPBMC68k dataset generated by conditional diffusion. The CD34+ cell is a rare cell type. (d) The AUC score of KNN in different cell types in the PBMC68k dataset.
  • Figure 4: Marker genes' expression levels in real cells of this type, generated cells of this type, and real cells of other types. (a) Marker genes of mammary B cells. (b) Marker gene of thymus memory B cells. (c) Marker gene of spleen macrophage cells.
  • Figure 5: (a) The MMD score of different methods at different timestamps. (b) The LISI score of different methods at different timestamps. (c) UMAP of real cells. (d) UMAP of cells generated by Gradient Interpolation.
  • ...and 6 more figures