Table of Contents
Fetching ...

Dirichlet Flow Matching with Applications to DNA Sequence Design

Hannes Stark, Bowen Jing, Chenyu Wang, Gabriele Corso, Bonnie Berger, Regina Barzilay, Tommi Jaakkola

TL;DR

<3-5 sentence high-level summary> Dirichlet flow matching addresses the limitations of naive linear flow matching on the simplex for discrete data by using Dirichlet-based probability paths and a derived vector field that maintains full simplex support. It enables guidance (classifier and classifier-free) and distillation for fast, one-step generation, demonstrated on complex DNA sequence design tasks including promoters and enhancers with favorable distributional metrics. The approach shows superior performance over baselines and offers practical speedups, making it a versatile framework for principled discrete sequence generation beyond DNA design.

Abstract

Discrete diffusion or flow models could enable faster and more controllable sequence generation than autoregressive models. We show that naïve linear flow matching on the simplex is insufficient toward this goal since it suffers from discontinuities in the training target and further pathologies. To overcome this, we develop Dirichlet flow matching on the simplex based on mixtures of Dirichlet distributions as probability paths. In this framework, we derive a connection between the mixtures' scores and the flow's vector field that allows for classifier and classifier-free guidance. Further, we provide distilled Dirichlet flow matching, which enables one-step sequence generation with minimal performance hits, resulting in $O(L)$ speedups compared to autoregressive models. On complex DNA sequence generation tasks, we demonstrate superior performance compared to all baselines in distributional metrics and in achieving desired design targets for generated sequences. Finally, we show that our classifier-free guidance approach improves unconditional generation and is effective for generating DNA that satisfies design targets. Code is available at https://github.com/HannesStark/dirichlet-flow-matching.

Dirichlet Flow Matching with Applications to DNA Sequence Design

TL;DR

<3-5 sentence high-level summary> Dirichlet flow matching addresses the limitations of naive linear flow matching on the simplex for discrete data by using Dirichlet-based probability paths and a derived vector field that maintains full simplex support. It enables guidance (classifier and classifier-free) and distillation for fast, one-step generation, demonstrated on complex DNA sequence design tasks including promoters and enhancers with favorable distributional metrics. The approach shows superior performance over baselines and offers practical speedups, making it a versatile framework for principled discrete sequence generation beyond DNA design.

Abstract

Discrete diffusion or flow models could enable faster and more controllable sequence generation than autoregressive models. We show that naïve linear flow matching on the simplex is insufficient toward this goal since it suffers from discontinuities in the training target and further pathologies. To overcome this, we develop Dirichlet flow matching on the simplex based on mixtures of Dirichlet distributions as probability paths. In this framework, we derive a connection between the mixtures' scores and the flow's vector field that allows for classifier and classifier-free guidance. Further, we provide distilled Dirichlet flow matching, which enables one-step sequence generation with minimal performance hits, resulting in speedups compared to autoregressive models. On complex DNA sequence generation tasks, we demonstrate superior performance compared to all baselines in distributional metrics and in achieving desired design targets for generated sequences. Finally, we show that our classifier-free guidance approach improves unconditional generation and is effective for generating DNA that satisfies design targets. Code is available at https://github.com/HannesStark/dirichlet-flow-matching.
Paper Structure (26 sections, 3 theorems, 40 equations, 11 figures, 2 tables, 2 algorithms)

This paper contains 26 sections, 3 theorems, 40 equations, 11 figures, 2 tables, 2 algorithms.

Key Result

Proposition 1

Suppose that a flow matching model is trained with the linear flow map (Equation eq:linear-flow-map). Then, for all $k = 2, \ldots K$ and $\mathbf{x} \sim p_t(\mathbf{x})$, the converged model posterior $p(\mathbf{x}_1 \mid \mathbf{x}) \propto p_t(\mathbf{x} \mid \mathbf{x}_1)p_\text{data}(\mathbf{x

Figures (11)

  • Figure 1: Overview of Dirichlet flow matching. We represent a sequence of discrete variables as a sequence of simplices. Here, we only show the simplex of one of the sequence positions. Left: starting from uniform noise on the probability simplex, we define conditional probability paths that approach a point mass at the vertex via a one-parameter family of Dirichlet distributions. We view a sequence of tokens as a sequence of simplices for which the probability path corresponds to noising tokens via superposition with all other possible tokens (during inference, the simplices depend on each other through a joint denoiser). Right: Comparison of the marginal probability paths and vector fields of Dirichlet and linear FM. The vector fields of Dirichlet FM are smooth in time and space, unlike linear FM.
  • Figure 2: Pathological behavior of linear flow matching on the simplex with $K=4$ (top) and $K=3$ (bottom). Each color represents a conditional probability path evolving over time toward its target vertex. At $t=1/4$, $t=1/3$, and $t=1/2$, the region of overlap between 4, 3, and 2 conditional probability paths disappears, respectively, corresponding to a shrinking set of possible values of $\mathbf{x}_1 \mid \mathbf{x}$ for any $\mathbf{x}$.
  • Figure 3: Vector field magnitudes of the conditional flow field $u_t(\mathbf{x} \mid \mathbf{x}_1 = \mathbf{e}_i)$ as a function of $x_i$ (the $i$th element of $\mathbf{x}$) plotted for varying values of $t$. In Dirichlet FM, the field vanishes at both $x_i=1$ (i.e., the target vertex) and $x_i=0$ (the opposite face).
  • Figure 4: Scaling to higher simplex dimensions. We train on simple categorical distributions with an increasing number of categories $K$ and measure the KL divergence between the generated distributions (512k samples) and the training target distribution. Dirichlet FM scales to larger $K$ much better than linear FM.
  • Figure 5: Classifier-free guidance for cell type conditional enhancer design. We generate enhancers that are only active in cell class via classifier-free guidance with varying $\gamma$. Shown are 4 classes of the Fly Brain cell data. The left y-axis FBD is computed between the generated sequences and the data distribution conditioned on the target class. For the first class,"PNG", functional sequences of Taskiran2023cell are available, and we show their FBD. The right y-axis Prob. refers to the target class probability of a classifier for the generated sequences in percent.
  • ...and 6 more figures

Theorems & Definitions (4)

  • Proposition 1
  • Proposition 2
  • Proposition 1
  • proof