Table of Contents
Fetching ...

Scalable Single-Cell Gene Expression Generation with Latent Diffusion Models

Giovanni Palla, Sudarshan Babu, Payam Dibaeinia, James D. Pearce, Donghui Li, Aly A. Khan, Theofanis Karaletsos, Jakub M. Tomczak

TL;DR

This paper tackles the challenge of generating realistic single-cell gene expression profiles by enforcing exchangeability of genes and introducing scLDM, a Transformer-based VAE with fixed-size, permutation-invariant latent tokens. It replaces the Gaussian prior with a latent diffusion model parameterized by Diffusion Transformers, enabling multi-conditional, controllable generation via classifier-free guidance. The two-stage approach yields a powerful encoder–decoder architecture (MCAB) and a diffusion-based latent space, achieving state-of-the-art results in reconstruction, unconditional and conditional generation on observational and perturbational data, and producing embeddings that bolster downstream classification tasks. The work demonstrates the practical impact of enforcing exchangeability for scalable, high-fidelity generative modeling in single-cell genomics and sets the stage for applying similar foundations to other exchangeable biological data and multi-omics integration.

Abstract

Computational modeling of single-cell gene expression is crucial for understanding cellular processes, but generating realistic expression profiles remains a major challenge. This difficulty arises from the count nature of gene expression data and complex latent dependencies among genes. Existing generative models often impose artificial gene orderings or rely on shallow neural network architectures. We introduce a scalable latent diffusion model for single-cell gene expression data, which we refer to as scLDM, that respects the fundamental exchangeability property of the data. Our VAE uses fixed-size latent variables leveraging a unified Multi-head Cross-Attention Block (MCAB) architecture, which serves dual roles: permutation-invariant pooling in the encoder and permutation-equivariant unpooling in the decoder. We enhance this framework by replacing the Gaussian prior with a latent diffusion model using Diffusion Transformers and linear interpolants, enabling high-quality generation with multi-conditional classifier-free guidance. We show its superior performance in a variety of experiments for both observational and perturbational single-cell data, as well as downstream tasks like cell-level classification.

Scalable Single-Cell Gene Expression Generation with Latent Diffusion Models

TL;DR

This paper tackles the challenge of generating realistic single-cell gene expression profiles by enforcing exchangeability of genes and introducing scLDM, a Transformer-based VAE with fixed-size, permutation-invariant latent tokens. It replaces the Gaussian prior with a latent diffusion model parameterized by Diffusion Transformers, enabling multi-conditional, controllable generation via classifier-free guidance. The two-stage approach yields a powerful encoder–decoder architecture (MCAB) and a diffusion-based latent space, achieving state-of-the-art results in reconstruction, unconditional and conditional generation on observational and perturbational data, and producing embeddings that bolster downstream classification tasks. The work demonstrates the practical impact of enforcing exchangeability for scalable, high-fidelity generative modeling in single-cell genomics and sets the stage for applying similar foundations to other exchangeable biological data and multi-omics integration.

Abstract

Computational modeling of single-cell gene expression is crucial for understanding cellular processes, but generating realistic expression profiles remains a major challenge. This difficulty arises from the count nature of gene expression data and complex latent dependencies among genes. Existing generative models often impose artificial gene orderings or rely on shallow neural network architectures. We introduce a scalable latent diffusion model for single-cell gene expression data, which we refer to as scLDM, that respects the fundamental exchangeability property of the data. Our VAE uses fixed-size latent variables leveraging a unified Multi-head Cross-Attention Block (MCAB) architecture, which serves dual roles: permutation-invariant pooling in the encoder and permutation-equivariant unpooling in the decoder. We enhance this framework by replacing the Gaussian prior with a latent diffusion model using Diffusion Transformers and linear interpolants, enabling high-quality generation with multi-conditional classifier-free guidance. We show its superior performance in a variety of experiments for both observational and perturbational single-cell data, as well as downstream tasks like cell-level classification.

Paper Structure

This paper contains 72 sections, 39 equations, 14 figures, 18 tables.

Figures (14)

  • Figure 1: Our deep generative model, scLDM, for single-cell gene expression data. A: A fully transformer-based architecture for processing gene expressions. The encoder network results in permutation-invariant latent variables represented as tokens. The decoder network returns permutation-equivariant counts for given gene IDs. B: At the second stage, a vanilla prior is replaced by a latent diffusion model. We model latent tokens using Diffusion Transformers (DiT), and train the resulting LDM using linear interpolants and the flow matching loss. Sampling is carried out by applying the Scalable Interpolant Transformers (SiT) library ma2024sit.
  • Figure 2: Conditional generation for the HLCA dataset for: (a) scLDM, (b) CFGen and (c) scdiffusion. Expression levels for 3 marker genes: (d) ACTA2, (e) COL1A1 and (f) CFD, markers of "alveolar type 2 fibroblast cell", corresponding to cell populations in the insets.
  • Figure 3: Conditional generation across multiple attributes: cell type and perturbation. (a) Generated vs. true cells across all cell types in the Parse 1M dataset show close alignment. (b--c) For CD4 Naive cells, conditioning on cytokine perturbations (IL-9, LT-alpha1-beta2) produces perturbation-specific shifts consistent with the true test distributions. (d) Generated vs. true cells across all cell types in the Replogle dataset. (e--f) For HepG2 cells, conditioning on genetic perturbations (PPP6C, ZDHHC7) yields realistic perturbation-dependent distributions that closely follow the experimental data.
  • Figure 4: Ablations on VAE width, depth and number of latent tokens
  • Figure 5: Visualization of the gene-wise variance for true and generated data for CFGen (left), scDiffusion (middle) our model (right), for the conditional generation settings on Dentate Gyrus. The error bars represent the standard errors over 3 seeds.
  • ...and 9 more figures

Theorems & Definitions (3)

  • proof
  • proof
  • proof