Understanding the Natural Language of DNA using Encoder-Decoder Foundation Models with Byte-level Precision

Aditya Malusare; Harish Kothandaraman; Dipesh Tamboli; Nadia A. Lanman; Vaneet Aggarwal

Understanding the Natural Language of DNA using Encoder-Decoder Foundation Models with Byte-level Precision

Aditya Malusare, Harish Kothandaraman, Dipesh Tamboli, Nadia A. Lanman, Vaneet Aggarwal

TL;DR

ENBED introduces a byte-level encoder-decoder Transformer for DNA that enables robust sequence-to-sequence analyses by combining cross-attention with sub-quadratic attention. Pretrained with Masked Language Modeling on telomere-to-telomere and other reference genomes, the model is fine-tuned for tasks including noise detection, biological function annotation, and mutation generation in influenza, consistently outperforming state-of-the-art on multiple benchmarks. The results demonstrate strong performance gains across diverse genomic datasets, high sensitivity to sequencing noise, and capable generation of plausible mutations, highlighting the approach's potential for variant effect prediction and genome interpretation. The work also emphasizes the importance of high-quality reference data and encoder-decoder architectures for modeling complex genomic transformations.

Abstract

This paper presents the Ensemble Nucleotide Byte-level Encoder-Decoder (ENBED) foundation model, analyzing DNA sequences at byte-level precision with an encoder-decoder Transformer architecture. ENBED uses a sub-quadratic implementation of attention to develop an efficient model capable of sequence-to-sequence transformations, generalizing previous genomic models with encoder-only or decoder-only architectures. We use Masked Language Modeling to pre-train the foundation model using reference genome sequences and apply it in the following downstream tasks: (1) identification of enhancers, promotors and splice sites, (2) recognition of sequences containing base call mismatches and insertion/deletion errors, an advantage over tokenization schemes involving multiple base pairs, which lose the ability to analyze with byte-level precision, (3) identification of biological function annotations of genomic sequences, and (4) generating mutations of the Influenza virus using the encoder-decoder architecture and validating them against real-world observations. In each of these tasks, we demonstrate significant improvement as compared to the existing state-of-the-art results.

Understanding the Natural Language of DNA using Encoder-Decoder Foundation Models with Byte-level Precision

TL;DR

Abstract

Paper Structure (48 sections, 1 equation, 3 figures, 12 tables)

This paper contains 48 sections, 1 equation, 3 figures, 12 tables.

Introduction
Limitations of previous work
Architecture.
Tokenization.
Our contributions
Evaluation of performance on genomic benchmark datasets.
Identifying sequencing noise.
Biological function annotations.
Studying mutations as a sequence-to-sequence process.
Methods
Encoder-Decoder Model Architecture
Tokenization
Attention
Sliding-window attention.
Global attention.
...and 33 more sections

Figures (3)

Figure 1: Model Architecture. The model is constructed using encoder and decoder blocks with a ratio of 2:1. Both types of blocks consist of attention and feed-forward layers, with the decoder blocks additionally incorporating the embeddings in encoder-decoder attention layers.
Figure 2: Interpreting Attention Layers. We visualize the twelve attention heads of the pre-trained ENBED foundation model.
Figure 3: Phylogenetic Tree.

Understanding the Natural Language of DNA using Encoder-Decoder Foundation Models with Byte-level Precision

TL;DR

Abstract

Understanding the Natural Language of DNA using Encoder-Decoder Foundation Models with Byte-level Precision

Authors

TL;DR

Abstract

Table of Contents

Figures (3)