Table of Contents
Fetching ...

Enhancing DNA Foundation Models to Address Masking Inefficiencies

Monireh Safari, Pablo Millan Arias, Scott C. Lowe, Lila Kari, Angel X. Chang, Graham W. Taylor

TL;DR

DNA foundation models pretrained with masked language modeling suffer from distribution shift when [MASK] tokens are absent at inference time. BarcodeMAE addresses this by adopting an encoder-decoder masked autoencoder where the encoder omits [MASK] tokens and the decoder reconstructs them, aligning pretraining with downstream inference. Pretrained on BIOSCAN-5M, BarcodeMAE achieves strong genus-level classification gains and a robust harmonic mean across closed- and open-world tasks, illustrating the benefit of domain-specific architectural choices over conventional encoder-only MLM approaches. The work highlights the importance of matching training objectives to real-use scenarios in genomic sequence analysis and provides a practical, scalable approach for DNA barcode representation learning.

Abstract

Masked language modelling (MLM) as a pretraining objective has been widely adopted in genomic sequence modelling. While pretrained models can successfully serve as encoders for various downstream tasks, the distribution shift between pretraining and inference detrimentally impacts performance, as the pretraining task is to map [MASK] tokens to predictions, yet the [MASK] is absent during downstream applications. This means the encoder does not prioritize its encodings of non-[MASK] tokens, and expends parameters and compute on work only relevant to the MLM task, despite this being irrelevant at deployment time. In this work, we propose a modified encoder-decoder architecture based on the masked autoencoder framework, designed to address this inefficiency within a BERT-based transformer. We empirically show that the resulting mismatch is particularly detrimental in genomic pipelines where models are often used for feature extraction without fine-tuning. We evaluate our approach on the BIOSCAN-5M dataset, comprising over 2 million unique DNA barcodes. We achieve substantial performance gains in both closed-world and open-world classification tasks when compared against causal models and bidirectional architectures pretrained with MLM tasks.

Enhancing DNA Foundation Models to Address Masking Inefficiencies

TL;DR

DNA foundation models pretrained with masked language modeling suffer from distribution shift when [MASK] tokens are absent at inference time. BarcodeMAE addresses this by adopting an encoder-decoder masked autoencoder where the encoder omits [MASK] tokens and the decoder reconstructs them, aligning pretraining with downstream inference. Pretrained on BIOSCAN-5M, BarcodeMAE achieves strong genus-level classification gains and a robust harmonic mean across closed- and open-world tasks, illustrating the benefit of domain-specific architectural choices over conventional encoder-only MLM approaches. The work highlights the importance of matching training objectives to real-use scenarios in genomic sequence analysis and provides a practical, scalable approach for DNA barcode representation learning.

Abstract

Masked language modelling (MLM) as a pretraining objective has been widely adopted in genomic sequence modelling. While pretrained models can successfully serve as encoders for various downstream tasks, the distribution shift between pretraining and inference detrimentally impacts performance, as the pretraining task is to map [MASK] tokens to predictions, yet the [MASK] is absent during downstream applications. This means the encoder does not prioritize its encodings of non-[MASK] tokens, and expends parameters and compute on work only relevant to the MLM task, despite this being irrelevant at deployment time. In this work, we propose a modified encoder-decoder architecture based on the masked autoencoder framework, designed to address this inefficiency within a BERT-based transformer. We empirically show that the resulting mismatch is particularly detrimental in genomic pipelines where models are often used for feature extraction without fine-tuning. We evaluate our approach on the BIOSCAN-5M dataset, comprising over 2 million unique DNA barcodes. We achieve substantial performance gains in both closed-world and open-world classification tasks when compared against causal models and bidirectional architectures pretrained with MLM tasks.

Paper Structure

This paper contains 16 sections, 2 equations, 3 figures, 3 tables.

Figures (3)

  • Figure 1: Comparison of pretraining processes for BarcodeBERT (left) and BarcodeMAE (right). BarcodeBERT uses an encoder-only transformer architecture with direct masking. BarcodeMAE processes DNA barcode sequences through a transformer encoder-decoder architecture. The masking strategy differs from other foundation models by excluding the [MASK] token from the encoder input, requiring the decoder to predict masked sequences. After pretraining, the decoder is discarded and only the encoder is used for downstream tasks.
  • Figure 2: t-SNE visualization of DNA barcode embeddings from BarcodeBERT (left) and BarcodeMAE (right) for 20 randomly selected underrepresented genera. Each point represents a DNA barcode sequence, and colours indicate different genera. BarcodeMAE shows more distinct and well-separated clusters, suggesting better discrimination between genera compared to BarcodeBERT.
  • Figure 3: Impact of masking and token deletion on genus-level classification accuracy. While BarcodeBERT shows stability at higher drop rates, the practical inference scenario occurs at $x\!=\!0$ with no masking, where BarcodeMAE demonstrates superior performance. The robustness to masking or removing tokens shown by BarcodeBERT does not correspond to an improved real-world performance since these conditions are not encountered during inference.