Table of Contents
Fetching ...

Toward Understanding BERT-Like Pre-Training for DNA Foundation Models

Chaoqi Liang, Lifeng Qiao, Peng Ye, Nanqing Dong, Jianle Sun, Weiqiang Bai, Yuchen Ren, Xinzhu Ma, Hongliang Yan, Chunfeng Song, Wanli Ouyang, Wangmeng Zuo

TL;DR

This work analyzes how BERT-like pre-training should be tailored for DNA sequences, revealing that overlapping K-mer tokenization helps fine-tuning but causes rapid convergence during pre-training, potentially under-training important layers. It introduces RandomMask, a curriculum-like masking strategy that gradually expands the masked region to enforce learning of region-level information. Across six downstream tasks, RandomMask with overlapping 6-mer tokenization achieves state-of-the-art performance, notably 68.16% MCC on epigenetic mark prediction and strong gains on transcription factor and promoter-related tasks. The findings highlight the need to align tokenizer design and pre-training objectives with DNA's nucleotide- and region-level structure, enabling more robust DNA foundation models with practical impact on genomic analysis tasks.

Abstract

With the success of large-scale pre-training in language tasks, there is an increasing trend of applying it to the domain of life sciences. In particular, pre-training methods based on DNA sequences have received increasing attention because of their potential to capture general information about genes. However, existing pre-training methods for DNA sequences largely rely on direct adoptions of BERT pre-training from NLP, lacking a comprehensive understanding and a specifically tailored approach. To address this research gap, we provide the first empirical study with three insightful observations. Based on the empirical study, we notice that overlapping tokenizer can benefit the fine-tuning of downstream tasks but leads to inadequate pre-training with fast convergence. To unleash the pre-training potential, we introduce a novel approach called RandomMask, which gradually increases the task difficulty of BERT-like pre-training by continuously expanding its mask boundary, forcing the model to learn more knowledge. RandomMask is simple but effective, achieving state-of-the-art performance across 6 downstream tasks. RandomMask achieves a staggering 68.16\% in Matthew's correlation coefficient for Epigenetic Mark Prediction, a groundbreaking increase of 19.85\% over the baseline and a remarkable 3.69\% improvement over the previous state-of-the-art result.

Toward Understanding BERT-Like Pre-Training for DNA Foundation Models

TL;DR

This work analyzes how BERT-like pre-training should be tailored for DNA sequences, revealing that overlapping K-mer tokenization helps fine-tuning but causes rapid convergence during pre-training, potentially under-training important layers. It introduces RandomMask, a curriculum-like masking strategy that gradually expands the masked region to enforce learning of region-level information. Across six downstream tasks, RandomMask with overlapping 6-mer tokenization achieves state-of-the-art performance, notably 68.16% MCC on epigenetic mark prediction and strong gains on transcription factor and promoter-related tasks. The findings highlight the need to align tokenizer design and pre-training objectives with DNA's nucleotide- and region-level structure, enabling more robust DNA foundation models with practical impact on genomic analysis tasks.

Abstract

With the success of large-scale pre-training in language tasks, there is an increasing trend of applying it to the domain of life sciences. In particular, pre-training methods based on DNA sequences have received increasing attention because of their potential to capture general information about genes. However, existing pre-training methods for DNA sequences largely rely on direct adoptions of BERT pre-training from NLP, lacking a comprehensive understanding and a specifically tailored approach. To address this research gap, we provide the first empirical study with three insightful observations. Based on the empirical study, we notice that overlapping tokenizer can benefit the fine-tuning of downstream tasks but leads to inadequate pre-training with fast convergence. To unleash the pre-training potential, we introduce a novel approach called RandomMask, which gradually increases the task difficulty of BERT-like pre-training by continuously expanding its mask boundary, forcing the model to learn more knowledge. RandomMask is simple but effective, achieving state-of-the-art performance across 6 downstream tasks. RandomMask achieves a staggering 68.16\% in Matthew's correlation coefficient for Epigenetic Mark Prediction, a groundbreaking increase of 19.85\% over the baseline and a remarkable 3.69\% improvement over the previous state-of-the-art result.
Paper Structure (22 sections, 1 equation, 24 figures, 8 tables, 1 algorithm)

This paper contains 22 sections, 1 equation, 24 figures, 8 tables, 1 algorithm.

Figures (24)

  • Figure 1: MLMs for NLP.
  • Figure 2: MLMs for DNA non-overlapping 3-mer tokenizer.
  • Figure 3: MLMs for DNA overlapping 3-mer tokenizer.
  • Figure 5: Illustration of the region- and nucleotide-resolution information for DNA sequence modeling. DNA modeling requires the capture of information at two distinct levels. At the regional level, patterns of functional elements in DNA sequences span tens to hundreds of nucleotides, such as promoters and enhancers, which act as integrated units to regulate gene expressions. Besides, capturing information at the nucleotide resolution is also crucial, as variations in a single nucleotide of DNA sequences can result in significant alterations to gene functions.
  • Figure 6: Detailed t-SNE visualization of the embedding space learned by DNABERT with overlapping tokenizer. The (a) and (c) plots are the clustering of marginal nucleotides. The (b) plot clusters the two central nucleotides. The (d) plot illustrates the overall 6-mer tokens in the embedding space. The two central nucleotides of a 6-mer token determine the cluster in which it is placed in the embedding space, and the marginal nucleotides determine its placement within the cluster.
  • ...and 19 more figures