Table of Contents
Fetching ...

DiscDiff: Latent Diffusion Model for DNA Sequence Generation

Zehui Li, Yuhao Ni, William A V Beardall, Guoxuan Xia, Akashaditya Das, Guy-Bart Stan, Yiren Zhao

TL;DR

This work tackles the challenge of realistic DNA sequence generation by introducing DiscDiff, a latent diffusion framework tailored for discrete DNA data, paired with Absorb-Escape, a post-processing refinement that mitigates latent-to-input rounding errors. It presents EPD-GenDNA, the first large-scale, multi-species DNA dataset to benchmark unconditional and conditional generation across 15 species. Empirically, DiscDiff outperforms existing diffusion baselines on short and long sequences, and Absorb-Escape provides additional gains and controllability over motif distributions. The study demonstrates cross-species conditional generation and highlights potential practical impacts for gene therapy and protein production, supported by a thorough ablation and motif-focused evaluation.

Abstract

This paper introduces a novel framework for DNA sequence generation, comprising two key components: DiscDiff, a Latent Diffusion Model (LDM) tailored for generating discrete DNA sequences, and Absorb-Escape, a post-training algorithm designed to refine these sequences. Absorb-Escape enhances the realism of the generated sequences by correcting `round errors' inherent in the conversion process between latent and input spaces. Our approach not only sets new standards in DNA sequence generation but also demonstrates superior performance over existing diffusion models, in generating both short and long DNA sequences. Additionally, we introduce EPD-GenDNA, the first comprehensive, multi-species dataset for DNA generation, encompassing 160,000 unique sequences from 15 species. We hope this study will advance the generative modelling of DNA, with potential implications for gene therapy and protein production.

DiscDiff: Latent Diffusion Model for DNA Sequence Generation

TL;DR

This work tackles the challenge of realistic DNA sequence generation by introducing DiscDiff, a latent diffusion framework tailored for discrete DNA data, paired with Absorb-Escape, a post-processing refinement that mitigates latent-to-input rounding errors. It presents EPD-GenDNA, the first large-scale, multi-species DNA dataset to benchmark unconditional and conditional generation across 15 species. Empirically, DiscDiff outperforms existing diffusion baselines on short and long sequences, and Absorb-Escape provides additional gains and controllability over motif distributions. The study demonstrates cross-species conditional generation and highlights potential practical impacts for gene therapy and protein production, supported by a thorough ablation and motif-focused evaluation.

Abstract

This paper introduces a novel framework for DNA sequence generation, comprising two key components: DiscDiff, a Latent Diffusion Model (LDM) tailored for generating discrete DNA sequences, and Absorb-Escape, a post-training algorithm designed to refine these sequences. Absorb-Escape enhances the realism of the generated sequences by correcting `round errors' inherent in the conversion process between latent and input spaces. Our approach not only sets new standards in DNA sequence generation but also demonstrates superior performance over existing diffusion models, in generating both short and long DNA sequences. Additionally, we introduce EPD-GenDNA, the first comprehensive, multi-species dataset for DNA generation, encompassing 160,000 unique sequences from 15 species. We hope this study will advance the generative modelling of DNA, with potential implications for gene therapy and protein production.
Paper Structure (44 sections, 3 equations, 24 figures, 14 tables)

This paper contains 44 sections, 3 equations, 24 figures, 14 tables.

Figures (24)

  • Figure 1: A comparison of Motif frequency distributions. The graphs contrast the occurrences of TATA-Box and Initiator motifs at each position in a set of samples from natural DNA against those generated by various models. A close match in frequency distributions suggests a higher realism and better performance for the generated DNA sequences. DiscDiff and Absorb-Escape outperform existing models qualitatively by a significant margin.
  • Figure 2: Generation Task with EPD-GenDNA. (a) Dataset: The EPD-GenDNA dataset includes 160K unique sequences from 15 species and 30 million samples with associated metadata. (b) Generative Modelling: A probabilistic model $p_{\theta}(s)$ is trained to generate new DNA sequences. (c) Model Evaluation: Generated sequences are evaluated through measuring latent distances and analyzing motif distributions.
  • Figure 3: DiscDiff Model: A two-step process for DNA sequence generation. Step 1: VAE Training: A sequence $s \in \{A, T, G, C\}^{2048/256}$ is encoded via a 1D-Encoder to a 2D-Encoder. The latent space representation $Z$ with parameters $\mu, \epsilon, \sigma$ is then decoded back to $\tilde{s}$ through a 2D-Decoder and 1D-Decoder. Step 2: Denoising Network Training: The latent representation $Z$ is processed through a denoising network comprising a ResNet Block, optional Self-Attention, and Cross Attention, with species and time information. The network outputs a Gaussian distribution $N(z; \mu, \Sigma)$. A U-Net architecture takes this distribution to produce various $z_0$ representations, which a Locked Decoder (fronzen parameters) used to generate the final DNA sequences.
  • Figure 4: The Absorb-Escape Algorithm: Enhancing DNA Sequence Prediction. While diffusion models (DMs) effectively capture broad DNA sequence features, they can err at the single nucleotide level. The Absorb-Escape algorithm corrects such errors by identifying and modifying low probability nucleotides, like changing 'TATT' to 'TATA'. This improves accuracy over using only DMs or autoregressive models, as shown in \ref{['fig:front-page']}.
  • Figure 5: Model Comparison of Chicken DNA Motif Distributions. This illustrates the Initiator and GC box frequencies across natural and generated DNA sequences near the TSS.
  • ...and 19 more figures