Gumbel-Softmax Flow Matching with Straight-Through Guidance for Controllable Biological Sequence Generation
Sophia Tang, Yinuo Zhang, Alexander Tong, Pranam Chatterjee
TL;DR
This work addresses the challenge of scalable, controllable discrete sequence generation by introducing Gumbel-Softmax Flow Matching (FM) and Gumbel-Softmax Score Matching (SM) on the simplex, enabled by a time-varying Gumbel-Softmax interpolant. It derives a parameterized velocity field and a corresponding score model to transport from noisy interior distributions toward high-quality, vertex-concentrated sequences, while ensuring mass conservation on the simplex. To enable training-free guidance, it presents Straight-Through Guided Flows (STGFlow), which uses straight-through estimators with pre-trained classifiers to steer inference toward optimal sequences, applicable to any discrete flow method. The approach achieves state-of-the-art or competitive results across conditional DNA promoter design, de novo protein sequence design, and target-binding peptide design, demonstrating scalable, diverse, and controllable generation for complex biological tasks.
Abstract
Flow matching in the continuous simplex has emerged as a promising strategy for DNA sequence design, but struggles to scale to higher simplex dimensions required for peptide and protein generation. We introduce Gumbel-Softmax Flow and Score Matching, a generative framework on the simplex based on a novel Gumbel-Softmax interpolant with a time-dependent temperature. Using this interpolant, we introduce Gumbel-Softmax Flow Matching by deriving a parameterized velocity field that transports from smooth categorical distributions to distributions concentrated at a single vertex of the simplex. We alternatively present Gumbel-Softmax Score Matching which learns to regress the gradient of the probability density. Our framework enables high-quality, diverse generation and scales efficiently to higher-dimensional simplices. To enable training-free guidance, we propose Straight-Through Guided Flows (STGFlow), a classifier-based guidance method that leverages straight-through estimators to steer the unconditional velocity field toward optimal vertices of the simplex. STGFlow enables efficient inference-time guidance using classifiers pre-trained on clean sequences, and can be used with any discrete flow method. Together, these components form a robust framework for controllable de novo sequence generation. We demonstrate state-of-the-art performance in conditional DNA promoter design, sequence-only protein generation, and target-binding peptide design for rare disease treatment.
