Table of Contents
Fetching ...

Gumbel-Softmax Flow Matching with Straight-Through Guidance for Controllable Biological Sequence Generation

Sophia Tang, Yinuo Zhang, Alexander Tong, Pranam Chatterjee

TL;DR

This work addresses the challenge of scalable, controllable discrete sequence generation by introducing Gumbel-Softmax Flow Matching (FM) and Gumbel-Softmax Score Matching (SM) on the simplex, enabled by a time-varying Gumbel-Softmax interpolant. It derives a parameterized velocity field and a corresponding score model to transport from noisy interior distributions toward high-quality, vertex-concentrated sequences, while ensuring mass conservation on the simplex. To enable training-free guidance, it presents Straight-Through Guided Flows (STGFlow), which uses straight-through estimators with pre-trained classifiers to steer inference toward optimal sequences, applicable to any discrete flow method. The approach achieves state-of-the-art or competitive results across conditional DNA promoter design, de novo protein sequence design, and target-binding peptide design, demonstrating scalable, diverse, and controllable generation for complex biological tasks.

Abstract

Flow matching in the continuous simplex has emerged as a promising strategy for DNA sequence design, but struggles to scale to higher simplex dimensions required for peptide and protein generation. We introduce Gumbel-Softmax Flow and Score Matching, a generative framework on the simplex based on a novel Gumbel-Softmax interpolant with a time-dependent temperature. Using this interpolant, we introduce Gumbel-Softmax Flow Matching by deriving a parameterized velocity field that transports from smooth categorical distributions to distributions concentrated at a single vertex of the simplex. We alternatively present Gumbel-Softmax Score Matching which learns to regress the gradient of the probability density. Our framework enables high-quality, diverse generation and scales efficiently to higher-dimensional simplices. To enable training-free guidance, we propose Straight-Through Guided Flows (STGFlow), a classifier-based guidance method that leverages straight-through estimators to steer the unconditional velocity field toward optimal vertices of the simplex. STGFlow enables efficient inference-time guidance using classifiers pre-trained on clean sequences, and can be used with any discrete flow method. Together, these components form a robust framework for controllable de novo sequence generation. We demonstrate state-of-the-art performance in conditional DNA promoter design, sequence-only protein generation, and target-binding peptide design for rare disease treatment.

Gumbel-Softmax Flow Matching with Straight-Through Guidance for Controllable Biological Sequence Generation

TL;DR

This work addresses the challenge of scalable, controllable discrete sequence generation by introducing Gumbel-Softmax Flow Matching (FM) and Gumbel-Softmax Score Matching (SM) on the simplex, enabled by a time-varying Gumbel-Softmax interpolant. It derives a parameterized velocity field and a corresponding score model to transport from noisy interior distributions toward high-quality, vertex-concentrated sequences, while ensuring mass conservation on the simplex. To enable training-free guidance, it presents Straight-Through Guided Flows (STGFlow), which uses straight-through estimators with pre-trained classifiers to steer inference toward optimal sequences, applicable to any discrete flow method. The approach achieves state-of-the-art or competitive results across conditional DNA promoter design, de novo protein sequence design, and target-binding peptide design, demonstrating scalable, diverse, and controllable generation for complex biological tasks.

Abstract

Flow matching in the continuous simplex has emerged as a promising strategy for DNA sequence design, but struggles to scale to higher simplex dimensions required for peptide and protein generation. We introduce Gumbel-Softmax Flow and Score Matching, a generative framework on the simplex based on a novel Gumbel-Softmax interpolant with a time-dependent temperature. Using this interpolant, we introduce Gumbel-Softmax Flow Matching by deriving a parameterized velocity field that transports from smooth categorical distributions to distributions concentrated at a single vertex of the simplex. We alternatively present Gumbel-Softmax Score Matching which learns to regress the gradient of the probability density. Our framework enables high-quality, diverse generation and scales efficiently to higher-dimensional simplices. To enable training-free guidance, we propose Straight-Through Guided Flows (STGFlow), a classifier-based guidance method that leverages straight-through estimators to steer the unconditional velocity field toward optimal vertices of the simplex. STGFlow enables efficient inference-time guidance using classifiers pre-trained on clean sequences, and can be used with any discrete flow method. Together, these components form a robust framework for controllable de novo sequence generation. We demonstrate state-of-the-art performance in conditional DNA promoter design, sequence-only protein generation, and target-binding peptide design for rare disease treatment.

Paper Structure

This paper contains 45 sections, 5 theorems, 75 equations, 9 figures, 7 tables, 5 algorithms.

Key Result

Proposition 1

The proposed conditional vector field and conditional probability path together satisfy the continuity equation (Equation eq:continuity equation) and thus define a valid flow matching trajectory on the interior of the simplex.

Figures (9)

  • Figure 1: Overview of Gumbel-Softmax Flow Matching. Gumbel-softmax transformations are applied to clean one-hot sequences for varying temperatures dependent on time. The embedded noisy distributions are passed into a parameterized flow or score model and error prediction model to predict the conditional flow velocity and score function.
  • Figure 2: Straight-Through Guided Flows (STGFlow). We compute the gradients of the classifier function with respect to $M$ discrete sequences sampled from the intermediate token distribution $\mathbf{x}_t$, which act as a guided flow velocity that steers the unconditional trajectory towards sequences with optimal scores.
  • Figure 3: Predicted structures of de novo generated proteins from Gumbel-Softmax FM. The structures, pLDDT, pAE, and pTM scores are predicted with ESMFold Lin2023-gh
  • Figure 4: Gumbel-Softmax FM generated peptide binders for three targets with no known binders. (A) $10$ a.a. designed binder to JPH3 (structure generated with AlphaFold3) involved in Huntington’s Disease-Like 2. (B) $10$ a.a. designed binder to GFAP (PDB: 6A9P) involved in Alexander Disease. (C) $7$ a.a. designed binder to eIF2B (PDB: 6CAJ) involved in Vanishing White Matter Disease. Docked with AutoDock VINA and polar contacts within $3.5$ Å are annotated. Additional targets are shown in Table \ref{['table:Peptide No Existing Binder']}.
  • Figure 5: Comparison of existing and Gumbel-Softmax FM designed binder to protein 4EZN. AutoDock VINA docking score of the designed binder ($-6.5$ kcal/mol; magenta) is lower than that of the existing binder ($-4.1$ kcal/mol; green) indicating stronger binding affinity. Polar contacts within $3.5$ Å are annotated. Additional comparisons of existing and designed binders are in Table \ref{['table:Peptide Existing Binder']}.
  • ...and 4 more figures

Theorems & Definitions (5)

  • Proposition 1: Continuity
  • Proposition 2: Probability Mass Conservation
  • Proposition 3: Valid Flow Matching Loss
  • Proposition 4
  • Proposition 5: Conservation of Probability Mass of Straight-Through Gradient