PerturbDiff: Functional Diffusion for Single-Cell Perturbation Modeling

Xinyu Yuan; Xixian Liu; Ya Shi Zhang; Zuobai Zhang; Hongyu Guo; Jian Tang

PerturbDiff: Functional Diffusion for Single-Cell Perturbation Modeling

Xinyu Yuan, Xixian Liu, Ya Shi Zhang, Zuobai Zhang, Hongyu Guo, Jian Tang

Abstract

Building Virtual Cells that can accurately simulate cellular responses to perturbations is a long-standing goal in systems biology. A fundamental challenge is that high-throughput single-cell sequencing is destructive: the same cell cannot be observed both before and after a perturbation. Thus, perturbation prediction requires mapping unpaired control and perturbed populations. Existing models address this by learning maps between distributions, but typically assume a single fixed response distribution when conditioned on observed cellular context (e.g., cell type) and the perturbation type. In reality, responses vary systematically due to unobservable latent factors such as microenvironmental fluctuations and complex batch effects, forming a manifold of possible distributions for the same observed conditions. To account for this variability, we introduce PerturbDiff, which shifts modeling from individual cells to entire distributions. By embedding distributions as points in a Hilbert space, we define a diffusion-based generative process operating directly over probability distributions. This allows PerturbDiff to capture population-level response shifts across hidden factors. Benchmarks on established datasets show that PerturbDiff achieves state-of-the-art performance in single-cell response prediction and generalizes substantially better to unseen perturbations. See our project page (https://katarinayuan.github.io/PerturbDiff-ProjectPage/), where code and data will be made publicly available (https://github.com/DeepGraphLearning/PerturbDiff).

PerturbDiff: Functional Diffusion for Single-Cell Perturbation Modeling

Abstract

Paper Structure (57 sections, 10 theorems, 73 equations, 20 figures, 4 tables)

This paper contains 57 sections, 10 theorems, 73 equations, 20 figures, 4 tables.

Introduction
Preliminary
Notations: Cells, Populations, and Distributions
Problem Formulation
Diffusion Models
Related Work
Method
Motivation
Representing Cell Distributions
Diffusion Modeling on Cell Distributions
Training and Sampling
Architecture Design
Marginal Pretraining as a Prior
Discussion
Experiment
...and 42 more sections

Key Result

Lemma 2.1

For any distribution $P$ and $Q$ on $\mathcal{X}$ and any $\alpha \in [0,1]$, the following hold: (1) Linearity under mixing: ${\boldsymbol{\mu}}_{(\alpha P + (1-\alpha) Q)} = \alpha {\boldsymbol{\mu}}_P + (1-\alpha) {\boldsymbol{\mu}}_Q$, ensuring that convex combinations of distributions correspon

Figures (20)

Figure 1: Distributional variability in single-cell perturbation data. (a) Traditional methods operate on unpaired control and perturbed cells, learning to map a control cell distribution to a perturbed one. (b) However, variations in cell distributions arise from unobserved latent factors, inducing a family of distinct cell distributions and shifting the objective to learning a distribution over cell distributions.
Figure 2: Overview of the PerturbDiff framework. (a) Distribution-valued random variables $D_{c,\tau}$ and $D_{c,\tau}$ in cell space are mapped to Hilbert-space elements $\boldsymbol{\mu}_{{c,\tau}}$ and $\boldsymbol{\mu}_{{c}} \!\in \!\mathcal{H}_k$ via kernel mean embedding. (b) Diffusion is defined on perturbed embeddings $\boldsymbol{\mu}_{0\!}\!:=\!\boldsymbol{\mu}_{c,\tau}$, with a denoising network predicting the target $\boldsymbol{\mu}_\theta$. (c) Each MM-DiT block performs joint attention over control and perturbed token streams.
Figure 3: Perturbation prediction results across methods, metrics, and datasets. Each axis represents a performance metric, with higher values indicating better performance.
Figure 4: Per-metric scatter plots comparing PerturbDiff (From Scratch) and STATE. Each point denotes one held-out condition (62 conditions for PBMC; 735 for Tahoe100M).
Figure 5: Distribution of$-\!\log_{\text{10}}(p_{\text{adj\!}})$for true DE and non-DE genes on PBMC across true data, PerturbDiff (From Scratch), and STATE. Larger values indicate a gene is more likely DE.
...and 15 more figures

Theorems & Definitions (20)

Remark 4.1: Geometric Properties
Definition 4.2: Forward diffusion process
Lemma 2.1: Basic properties of kernel mean embeddings
Proposition 2.2
Definition 2.3: Symmetric operator
Definition 2.4: Positive semi-definite operator
Definition 2.5: Compact operator
Definition 2.6: Trace-class operator
Lemma 2.7: Affine transformations
Lemma 2.8: Sum of independent Gaussian random elements
...and 10 more

PerturbDiff: Functional Diffusion for Single-Cell Perturbation Modeling

Abstract

PerturbDiff: Functional Diffusion for Single-Cell Perturbation Modeling

Authors

Abstract

Table of Contents

Key Result

Figures (20)

Theorems & Definitions (20)