Table of Contents
Fetching ...

DNABERT-2: Efficient Foundation Model and Benchmark For Multi-Species Genome

Zhihan Zhou, Yanrong Ji, Weijian Li, Pratik Dutta, Ramana Davuluri, Han Liu

TL;DR

<3-5 sentence high-level summary> DNABERT-2 addresses inefficiencies in genome language modeling by replacing overlap-based k-mer tokenization with a Byte Pair Encoding (BPE) tokenizer, enabling shorter inputs and better sample efficiency. It further enhances the model with Attention with Linear Biases (ALiBi) for arbitrarily long sequences, Flash Attention for faster computation, and LoRA for efficient fine-tuning, resulting in a smaller but highly capable model. The authors standardize evaluation with the Genome Understanding Evaluation (GUE) benchmark, including 36 datasets across 9 tasks and multi-species genomes, to enable fair comparisons. Empirically, DNABERT-2 achieves performance comparable to state-of-the-art models while being ~21x smaller and ~92x faster in pretraining, and it demonstrates strong performance on long-sequence tasks in GUE+ across diverse species.</paper_summary>

Abstract

Decoding the linguistic intricacies of the genome is a crucial problem in biology, and pre-trained foundational models such as DNABERT and Nucleotide Transformer have made significant strides in this area. Existing works have largely hinged on k-mer, fixed-length permutations of A, T, C, and G, as the token of the genome language due to its simplicity. However, we argue that the computation and sample inefficiencies introduced by k-mer tokenization are primary obstacles in developing large genome foundational models. We provide conceptual and empirical insights into genome tokenization, building on which we propose to replace k-mer tokenization with Byte Pair Encoding (BPE), a statistics-based data compression algorithm that constructs tokens by iteratively merging the most frequent co-occurring genome segment in the corpus. We demonstrate that BPE not only overcomes the limitations of k-mer tokenization but also benefits from the computational efficiency of non-overlapping tokenization. Based on these insights, we introduce DNABERT-2, a refined genome foundation model that adapts an efficient tokenizer and employs multiple strategies to overcome input length constraints, reduce time and memory expenditure, and enhance model capability. Furthermore, we identify the absence of a comprehensive and standardized benchmark for genome understanding as another significant impediment to fair comparative analysis. In response, we propose the Genome Understanding Evaluation (GUE), a comprehensive multi-species genome classification dataset that amalgamates $36$ distinct datasets across $9$ tasks, with input lengths ranging from $70$ to $10000$. Through comprehensive experiments on the GUE benchmark, we demonstrate that DNABERT-2 achieves comparable performance to the state-of-the-art model with $21 \times$ fewer parameters and approximately $92 \times$ less GPU time in pre-training.

DNABERT-2: Efficient Foundation Model and Benchmark For Multi-Species Genome

TL;DR

<3-5 sentence high-level summary> DNABERT-2 addresses inefficiencies in genome language modeling by replacing overlap-based k-mer tokenization with a Byte Pair Encoding (BPE) tokenizer, enabling shorter inputs and better sample efficiency. It further enhances the model with Attention with Linear Biases (ALiBi) for arbitrarily long sequences, Flash Attention for faster computation, and LoRA for efficient fine-tuning, resulting in a smaller but highly capable model. The authors standardize evaluation with the Genome Understanding Evaluation (GUE) benchmark, including 36 datasets across 9 tasks and multi-species genomes, to enable fair comparisons. Empirically, DNABERT-2 achieves performance comparable to state-of-the-art models while being ~21x smaller and ~92x faster in pretraining, and it demonstrates strong performance on long-sequence tasks in GUE+ across diverse species.</paper_summary>

Abstract

Decoding the linguistic intricacies of the genome is a crucial problem in biology, and pre-trained foundational models such as DNABERT and Nucleotide Transformer have made significant strides in this area. Existing works have largely hinged on k-mer, fixed-length permutations of A, T, C, and G, as the token of the genome language due to its simplicity. However, we argue that the computation and sample inefficiencies introduced by k-mer tokenization are primary obstacles in developing large genome foundational models. We provide conceptual and empirical insights into genome tokenization, building on which we propose to replace k-mer tokenization with Byte Pair Encoding (BPE), a statistics-based data compression algorithm that constructs tokens by iteratively merging the most frequent co-occurring genome segment in the corpus. We demonstrate that BPE not only overcomes the limitations of k-mer tokenization but also benefits from the computational efficiency of non-overlapping tokenization. Based on these insights, we introduce DNABERT-2, a refined genome foundation model that adapts an efficient tokenizer and employs multiple strategies to overcome input length constraints, reduce time and memory expenditure, and enhance model capability. Furthermore, we identify the absence of a comprehensive and standardized benchmark for genome understanding as another significant impediment to fair comparative analysis. In response, we propose the Genome Understanding Evaluation (GUE), a comprehensive multi-species genome classification dataset that amalgamates distinct datasets across tasks, with input lengths ranging from to . Through comprehensive experiments on the GUE benchmark, we demonstrate that DNABERT-2 achieves comparable performance to the state-of-the-art model with fewer parameters and approximately less GPU time in pre-training.
Paper Structure (37 sections, 3 figures, 11 tables)

This paper contains 37 sections, 3 figures, 11 tables.

Figures (3)

  • Figure 1: Illustration of the drawbacks of k-mer tokenization. In the overlapping setting, information about a masked token is leaked by its adjacent tokens, while in the non-overlapping setting, adding/deleting one nucleotide base leads to a dramatic change in the tokenized sequence.
  • Figure 2: Illustration of the BPE vocabulary constructions.
  • Figure 3: This figure presents the average token length, average sequence length reduced after tokenization, and model performance on the GUE benchmark with different vocabulary sizes.