Table of Contents
Fetching ...

A Unification of Discrete, Gaussian, and Simplicial Diffusion

Nuria Alina Chandra, Yucen Lily Li, Alan N. Amin, Alex Ali, Joshua Rollins, Sebastian W. Ober, Aniruddh Raghu, Andrew Gordon Wilson

TL;DR

This work unifies discrete, Gaussian, and simplicial diffusion under the Wright-Fisher diffusion framework, showing the three models arise as distinct limits and deriving the connections among their likelihoods and hyperparameters. It addresses the long-standing stability issues of simplicial diffusion by leveraging exact Wright-Fisher sampling and related genetics theory, yielding a fast, stable diffusion method for DNA generation. A key practical contribution is the sufficient-statistic parameterization (SSP), which enables training a single neural network that can perform diffusion across all three domains at test time, achieving competitive results across proteins, language, and DNA. Together, these advances enable robust, domain-agnostic diffusion modeling with cross-domain transfer and flexible deployment in downstream tasks.

Abstract

To model discrete sequences such as DNA, proteins, and language using diffusion, practitioners must choose between three major methods: diffusion in discrete space, Gaussian diffusion in Euclidean space, or diffusion on the simplex. Despite their shared goal, these models have disparate algorithms, theoretical structures, and tradeoffs: discrete diffusion has the most natural domain, Gaussian diffusion has more mature algorithms, and diffusion on the simplex in principle combines the strengths of the other two but in practice suffers from a numerically unstable stochastic processes. Ideally we could see each of these models as instances of the same underlying framework, and enable practitioners to switch between models for downstream applications. However previous theories have only considered connections in special cases. Here we build a theory unifying all three methods of discrete diffusion as different parameterizations of the same underlying process: the Wright-Fisher population genetics model. In particular, we find simplicial and Gaussian diffusion as two large-population limits. Our theory formally connects the likelihoods and hyperparameters of these models and leverages decades of mathematical genetics literature to unlock stable simplicial diffusion. Finally, we relieve the practitioner of balancing model trade-offs by demonstrating it is possible to train a single model that can perform diffusion in any of these three domains at test time. Our experiments show that Wright-Fisher simplicial diffusion is more stable and outperforms previous simplicial diffusion models on conditional DNA generation. We also show that we can train models on multiple domains at once that are competitive with models trained on any individual domain.

A Unification of Discrete, Gaussian, and Simplicial Diffusion

TL;DR

This work unifies discrete, Gaussian, and simplicial diffusion under the Wright-Fisher diffusion framework, showing the three models arise as distinct limits and deriving the connections among their likelihoods and hyperparameters. It addresses the long-standing stability issues of simplicial diffusion by leveraging exact Wright-Fisher sampling and related genetics theory, yielding a fast, stable diffusion method for DNA generation. A key practical contribution is the sufficient-statistic parameterization (SSP), which enables training a single neural network that can perform diffusion across all three domains at test time, achieving competitive results across proteins, language, and DNA. Together, these advances enable robust, domain-agnostic diffusion modeling with cross-domain transfer and flexible deployment in downstream tasks.

Abstract

To model discrete sequences such as DNA, proteins, and language using diffusion, practitioners must choose between three major methods: diffusion in discrete space, Gaussian diffusion in Euclidean space, or diffusion on the simplex. Despite their shared goal, these models have disparate algorithms, theoretical structures, and tradeoffs: discrete diffusion has the most natural domain, Gaussian diffusion has more mature algorithms, and diffusion on the simplex in principle combines the strengths of the other two but in practice suffers from a numerically unstable stochastic processes. Ideally we could see each of these models as instances of the same underlying framework, and enable practitioners to switch between models for downstream applications. However previous theories have only considered connections in special cases. Here we build a theory unifying all three methods of discrete diffusion as different parameterizations of the same underlying process: the Wright-Fisher population genetics model. In particular, we find simplicial and Gaussian diffusion as two large-population limits. Our theory formally connects the likelihoods and hyperparameters of these models and leverages decades of mathematical genetics literature to unlock stable simplicial diffusion. Finally, we relieve the practitioner of balancing model trade-offs by demonstrating it is possible to train a single model that can perform diffusion in any of these three domains at test time. Our experiments show that Wright-Fisher simplicial diffusion is more stable and outperforms previous simplicial diffusion models on conditional DNA generation. We also show that we can train models on multiple domains at once that are competitive with models trained on any individual domain.

Paper Structure

This paper contains 88 sections, 15 theorems, 134 equations, 13 figures, 6 algorithms.

Key Result

Theorem 4.1

(Formal statement and proof in App. app: gaussian proof) Call $0>-\lambda_1>-\lambda_2>\dots$ the eigenvalues of $\mathcal{L}$ and $P_1$ the projection onto the left eigenspace corresponding to $\lambda_1$. Without loss of generality, assume $\lambda_1=1$This assumption is for convenience. Rescale $

Figures (13)

  • Figure 1: Discrete, Gaussian, and Simplicial diffusion for discrete data are unified by Wright-Fisher diffusion.(a) Wright-Fisher diffusion with population size $\zeta=6$, showing mutation and reproduction processes across generations. (b) The three diffusion methods emerge as different limits of Wright-Fisher: discrete diffusion corresponds to $\zeta=1$, while Gaussian and simplicial diffusion arise as $\zeta \to \infty$ with zero and non-zero reproduction rates.
  • Figure 2: Discrete diffusion with a large population converges to Gaussian diffusion. With $\zeta=1000$, we show example trajectories $(\vec{x}_t)_t$ that converge to approximate Gaussians near $\vec{\pi}$.
  • Figure 3: The hollow parameterization leads to realistic reverse path samples.$\zeta=300$.
  • Figure 4: $\mathrm{emb}$ of amino acids from BLOSUM $\mathcal{L}$.$\mathrm{emb}(x_0)$ from Thm. \ref{['thm: gaussian']} for $\mathcal{L}$ from Amin2025-ag.
  • Figure 5: Improved simplicial diffusion performs accurate conditional DNA generation. We generate DNA samples of length 500 conditioned on accessibility with a classifier. (a) For an example target, we plot predicted accessibility profiles at the centre 150 positions of 5 example samples from each model. We smooth profiles with a bandwidth of 2. (b) For 1000 targets and 10 samples from each model, we plot the error between the predicted and target profiles and its standard error.
  • ...and 8 more figures

Theorems & Definitions (25)

  • Theorem 4.1
  • Theorem 5.1
  • Proposition 6.1
  • Proposition C.1
  • Proposition C.2
  • proof
  • Theorem E.1
  • proof
  • Proposition E.2
  • proof
  • ...and 15 more