scRDiT: Generating single-cell RNA-seq data by diffusion transformers and accelerating sampling
Shengze Dong, Zhuorui Cui, Ding Liu, Jinzhi Lei
TL;DR
scRDiT introduces a diffusion-transformer–based approach to generate high-fidelity synthetic scRNA-seq data from real datasets, addressing the scarcity of samples. By combining a DDPM framework with Diffusion Transformers and a zero-negation preprocessing step, it trains cell-type–specific models and uses DDIM sampling to achieve 10–20× faster generation. The method demonstrates robust performance across multiple datasets, improving zero-proportion and coefficient-of-variation realism while producing samples that closely match real distributions per cell type, as shown by KL, Wasserstein, and MMD metrics. This approach provides a practical tool for augmenting sparse scRNA-seq datasets, enabling more reliable downstream analyses and method benchmarking.
Abstract
Motivation: Single-cell RNA sequencing (scRNA-seq) is a groundbreaking technology extensively utilized in biological research, facilitating the examination of gene expression at the individual cell level within a given tissue sample. While numerous tools have been developed for scRNA-seq data analysis, the challenge persists in capturing the distinct features of such data and replicating virtual datasets that share analogous statistical properties. Results: Our study introduces a generative approach termed scRNA-seq Diffusion Transformer (scRDiT). This method generates virtual scRNA-seq data by leveraging a real dataset. The method is a neural network constructed based on Denoising Diffusion Probabilistic Models (DDPMs) and Diffusion Transformers (DiTs). This involves subjecting Gaussian noises to the real dataset through iterative noise-adding steps and ultimately restoring the noises to form scRNA-seq samples. This scheme allows us to learn data features from actual scRNA-seq samples during model training. Our experiments, conducted on two distinct scRNA-seq datasets, demonstrate superior performance. Additionally, the model sampling process is expedited by incorporating Denoising Diffusion Implicit Models (DDIM). scRDiT presents a unified methodology empowering users to train neural network models with their unique scRNA-seq datasets, enabling the generation of numerous high-quality scRNA-seq samples. Availability and implementation: https://github.com/DongShengze/scRDiT
