Table of Contents
Fetching ...

D3LM: A Discrete DNA Diffusion Language Model for Bidirectional DNA Understanding and Generation

Zhao Yang, Hengchang Liu, Chuan Cao, Bing Su

TL;DR

D3LM is presented, which unifies bidirectional representation learning and DNA generation through masked diffusion in discrete DNA space, enabling both bidirectional understanding and generation capabilities within a single model.

Abstract

Early DNA foundation models adopted BERT-style training, achieving good performance on DNA understanding tasks but lacking generative capabilities. Recent autoregressive models enable DNA generation, but employ left-to-right causal modeling that is suboptimal for DNA where regulatory relationships are inherently bidirectional. We present D3LM (\textbf{D}iscrete \textbf{D}NA \textbf{D}iffusion \textbf{L}anguage \textbf{M}odel), which unifies bidirectional representation learning and DNA generation through masked diffusion. D3LM directly adopts the Nucleotide Transformer (NT) v2 architecture but reformulates the training objective as masked diffusion in discrete DNA space, enabling both bidirectional understanding and generation capabilities within a single model. Compared to NT v2 of the same size, D3LM achieves improved performance on understanding tasks. Notably, on regulatory element generation, D3LM achieves an SFID of 10.92, closely approaching real DNA sequences (7.85) and substantially outperforming the previous best result of 29.16 from autoregressive models. Our work suggests diffusion language models as a promising paradigm for unified DNA foundation models. We further present the first systematic study of masked diffusion models in the DNA domain, investigating practical design choices such as tokenization schemes and sampling strategies, thereby providing empirical insights and a solid foundation for future research. D3LM has been released at https://huggingface.co/collections/Hengchang-Liu/d3lm.

D3LM: A Discrete DNA Diffusion Language Model for Bidirectional DNA Understanding and Generation

TL;DR

D3LM is presented, which unifies bidirectional representation learning and DNA generation through masked diffusion in discrete DNA space, enabling both bidirectional understanding and generation capabilities within a single model.

Abstract

Early DNA foundation models adopted BERT-style training, achieving good performance on DNA understanding tasks but lacking generative capabilities. Recent autoregressive models enable DNA generation, but employ left-to-right causal modeling that is suboptimal for DNA where regulatory relationships are inherently bidirectional. We present D3LM (\textbf{D}iscrete \textbf{D}NA \textbf{D}iffusion \textbf{L}anguage \textbf{M}odel), which unifies bidirectional representation learning and DNA generation through masked diffusion. D3LM directly adopts the Nucleotide Transformer (NT) v2 architecture but reformulates the training objective as masked diffusion in discrete DNA space, enabling both bidirectional understanding and generation capabilities within a single model. Compared to NT v2 of the same size, D3LM achieves improved performance on understanding tasks. Notably, on regulatory element generation, D3LM achieves an SFID of 10.92, closely approaching real DNA sequences (7.85) and substantially outperforming the previous best result of 29.16 from autoregressive models. Our work suggests diffusion language models as a promising paradigm for unified DNA foundation models. We further present the first systematic study of masked diffusion models in the DNA domain, investigating practical design choices such as tokenization schemes and sampling strategies, thereby providing empirical insights and a solid foundation for future research. D3LM has been released at https://huggingface.co/collections/Hengchang-Liu/d3lm.
Paper Structure (26 sections, 16 equations, 3 figures, 5 tables)

This paper contains 26 sections, 16 equations, 3 figures, 5 tables.

Figures (3)

  • Figure 1: Comparison of DNA modeling approaches.(a) Enhancers regulate promoters from either upstream or downstream positions, demonstrating bidirectional regulatory relationships in DNA. (b) BERT-style models use bidirectional attention but employ fixed masking ratios and lack generative capabilities. (c) Autoregressive models generate sequences left-to-right but cannot adjust earlier positions once generated, making it difficult to satisfy global constraints. (d) D3LM combines bidirectional modeling with generation through masked diffusion with variable masking ratios, enabling iterative refinement of all positions simultaneously.
  • Figure 2: D3LM framework overview.(a) Training with variable masking ratios sampled from a uniform distribution. The model learns to predict masked tokens using bidirectional attention. (b) Iterative generation starting from fully masked sequences, progressively unmasking tokens through repeated sampling. (c) Fine-tuning on downstream genomic tasks using frozen or trainable encoder with task-specific heads for promoter classification, histone modification, and splice site prediction.
  • Figure 3: Analysis of Model Properties. (a) Illustration of the scaling law behavior. (b) Visualization of the tokenization process.