Enhancing DNA Foundation Models to Address Masking Inefficiencies
Monireh Safari, Pablo Millan Arias, Scott C. Lowe, Lila Kari, Angel X. Chang, Graham W. Taylor
TL;DR
DNA foundation models pretrained with masked language modeling suffer from distribution shift when [MASK] tokens are absent at inference time. BarcodeMAE addresses this by adopting an encoder-decoder masked autoencoder where the encoder omits [MASK] tokens and the decoder reconstructs them, aligning pretraining with downstream inference. Pretrained on BIOSCAN-5M, BarcodeMAE achieves strong genus-level classification gains and a robust harmonic mean across closed- and open-world tasks, illustrating the benefit of domain-specific architectural choices over conventional encoder-only MLM approaches. The work highlights the importance of matching training objectives to real-use scenarios in genomic sequence analysis and provides a practical, scalable approach for DNA barcode representation learning.
Abstract
Masked language modelling (MLM) as a pretraining objective has been widely adopted in genomic sequence modelling. While pretrained models can successfully serve as encoders for various downstream tasks, the distribution shift between pretraining and inference detrimentally impacts performance, as the pretraining task is to map [MASK] tokens to predictions, yet the [MASK] is absent during downstream applications. This means the encoder does not prioritize its encodings of non-[MASK] tokens, and expends parameters and compute on work only relevant to the MLM task, despite this being irrelevant at deployment time. In this work, we propose a modified encoder-decoder architecture based on the masked autoencoder framework, designed to address this inefficiency within a BERT-based transformer. We empirically show that the resulting mismatch is particularly detrimental in genomic pipelines where models are often used for feature extraction without fine-tuning. We evaluate our approach on the BIOSCAN-5M dataset, comprising over 2 million unique DNA barcodes. We achieve substantial performance gains in both closed-world and open-world classification tasks when compared against causal models and bidirectional architectures pretrained with MLM tasks.
