Table of Contents
Fetching ...

Multi-Modal and Multi-Attribute Generation of Single Cells with CFGen

Alessandro Palma, Till Richter, Hanyi Zhang, Manuel Lubetzki, Alexander Tong, Andrea Dittadi, Fabian Theis

TL;DR

Single-cell data are discrete and multi-modal, and existing generative models often rely on normalized continuous representations. CFGen uses a conditional latent flow based on Flow Matching with Gaussian marginal paths to model discrete multi-modal counts, with $p(\mathbf{x}|\mathbf{z}, \ell) \sim \mathrm{NB}(\ell \ \mathrm{softmax}(h_\psi(\mathbf{z})), \bm{\theta})$ and $p(\mathbf{b}|\mathbf{z}) \sim \mathrm{Bernoulli}(\cdot)$, while $p(\mathbf{z}|\mathbf{y}, \ell)$ is learned via a conditional CNF. The method introduces compositional guidance for multi-attribute generation, enabling generation conditioned on multiple biological and technical covariates without training separate models. CFGen demonstrates superior generation fidelity across uni- and multi-modal data, facilitates robust data augmentation for rare cell types, and enables effective batch correction, making it a versatile tool for realistic single-cell simulation. The authors provide open-source software to promote adoption in computational biology workflows.

Abstract

Generative modeling of single-cell RNA-seq data is crucial for tasks like trajectory inference, batch effect removal, and simulation of realistic cellular data. However, recent deep generative models simulating synthetic single cells from noise operate on pre-processed continuous gene expression approximations, overlooking the discrete nature of single-cell data, which limits their effectiveness and hinders the incorporation of robust noise models. Additionally, aspects like controllable multi-modal and multi-label generation of cellular data remain underexplored. This work introduces CellFlow for Generation (CFGen), a flow-based conditional generative model that preserves the inherent discreteness of single-cell data. CFGen generates whole-genome multi-modal single-cell data reliably, improving the recovery of crucial biological data characteristics while tackling relevant generative tasks such as rare cell type augmentation and batch correction. We also introduce a novel framework for compositional data generation using Flow Matching. By showcasing CFGen on a diverse set of biological datasets and settings, we provide evidence of its value to the fields of computational biology and deep generative models.

Multi-Modal and Multi-Attribute Generation of Single Cells with CFGen

TL;DR

Single-cell data are discrete and multi-modal, and existing generative models often rely on normalized continuous representations. CFGen uses a conditional latent flow based on Flow Matching with Gaussian marginal paths to model discrete multi-modal counts, with and , while is learned via a conditional CNF. The method introduces compositional guidance for multi-attribute generation, enabling generation conditioned on multiple biological and technical covariates without training separate models. CFGen demonstrates superior generation fidelity across uni- and multi-modal data, facilitates robust data augmentation for rare cell types, and enables effective batch correction, making it a versatile tool for realistic single-cell simulation. The authors provide open-source software to promote adoption in computational biology workflows.

Abstract

Generative modeling of single-cell RNA-seq data is crucial for tasks like trajectory inference, batch effect removal, and simulation of realistic cellular data. However, recent deep generative models simulating synthetic single cells from noise operate on pre-processed continuous gene expression approximations, overlooking the discrete nature of single-cell data, which limits their effectiveness and hinders the incorporation of robust noise models. Additionally, aspects like controllable multi-modal and multi-label generation of cellular data remain underexplored. This work introduces CellFlow for Generation (CFGen), a flow-based conditional generative model that preserves the inherent discreteness of single-cell data. CFGen generates whole-genome multi-modal single-cell data reliably, improving the recovery of crucial biological data characteristics while tackling relevant generative tasks such as rare cell type augmentation and batch correction. We also introduce a novel framework for compositional data generation using Flow Matching. By showcasing CFGen on a diverse set of biological datasets and settings, we provide evidence of its value to the fields of computational biology and deep generative models.
Paper Structure (72 sections, 2 theorems, 29 equations, 22 figures, 11 tables, 2 algorithms)

This paper contains 72 sections, 2 theorems, 29 equations, 22 figures, 11 tables, 2 algorithms.

Key Result

Proposition 1

If the attributes $y_1,...,y_K$ are conditionally independent given $\mathbf{z}$, the vector field coincides with the velocity of the probability-flow ODE associated with the generative SDE of a diffusion model with a compositional score as in eq: comp_gudance_score.

Figures (22)

  • Figure 1: The CFGen generative model. A noise vector $\mathbf{z}_0$ sampled from a Gaussian prior $p_0$ is transformed into a latent cell representation $\mathbf{z}_1$ by a compositional flow, conditioned on multiple biological and technical attributes. Decoders for gene expression and DNA accessibility map $\mathbf{z}_1$ to the parameters of negative binomial and Bernoulli noise models, from which single-cell gene expression and DNA accessibility peaks are sampled.
  • Figure 2: (a) Comparison between the gene-wise empirical mean-variance trend in real data and samples from generative models. (b) Frequency of the number of zeroes per cell in real and generated data.
  • Figure 3: Qualitative evaluation of guidance performance on attribute pairs in the NeurIPS 2021 and Tabula Muris datasets. Left: unconditional performance with guidance weights at 0. Moving right: simulate 500 cells, progressively increasing the guidance strength of one attribute while keeping the counterpart unchanged.
  • Figure 4: Cell-type classification recall difference before and after augmentation as a function of cell type frequency. The classifier is a 10-nearest neighbor (kNN) model trained on the scGPT's representation space.
  • Figure 5: To perform batch correction, the scRNA-seq latent distribution is mapped to the prior distribution by inverting the flow model. The resulting points are then transported back to the data domain based on a common reference batch label and the original cell type label to preserve the biological structure. Cells are colored by batch.
  • ...and 17 more figures

Theorems & Definitions (2)

  • Proposition 1
  • Proposition 1