DiscDiff: Latent Diffusion Model for DNA Sequence Generation
Zehui Li, Yuhao Ni, William A V Beardall, Guoxuan Xia, Akashaditya Das, Guy-Bart Stan, Yiren Zhao
TL;DR
This work tackles the challenge of realistic DNA sequence generation by introducing DiscDiff, a latent diffusion framework tailored for discrete DNA data, paired with Absorb-Escape, a post-processing refinement that mitigates latent-to-input rounding errors. It presents EPD-GenDNA, the first large-scale, multi-species DNA dataset to benchmark unconditional and conditional generation across 15 species. Empirically, DiscDiff outperforms existing diffusion baselines on short and long sequences, and Absorb-Escape provides additional gains and controllability over motif distributions. The study demonstrates cross-species conditional generation and highlights potential practical impacts for gene therapy and protein production, supported by a thorough ablation and motif-focused evaluation.
Abstract
This paper introduces a novel framework for DNA sequence generation, comprising two key components: DiscDiff, a Latent Diffusion Model (LDM) tailored for generating discrete DNA sequences, and Absorb-Escape, a post-training algorithm designed to refine these sequences. Absorb-Escape enhances the realism of the generated sequences by correcting `round errors' inherent in the conversion process between latent and input spaces. Our approach not only sets new standards in DNA sequence generation but also demonstrates superior performance over existing diffusion models, in generating both short and long DNA sequences. Additionally, we introduce EPD-GenDNA, the first comprehensive, multi-species dataset for DNA generation, encompassing 160,000 unique sequences from 15 species. We hope this study will advance the generative modelling of DNA, with potential implications for gene therapy and protein production.
