Table of Contents
Fetching ...

dnaGrinder: a lightweight and high-capacity genomic foundation model

Qihang Zhao, Chi Zhang, Weixiong Zhang

TL;DR

dnaGrinder introduces an encoder-only genomic foundation model that efficiently handles long-range dependencies in DNA by combining memory-efficient BPE tokenization, sequence length warmup, ALiBi attention, and Flash Attention 2. It leverages multispecies pretraining plus human 1000 Genomes variants and employs data augmentation and parental variant locus replacement to expand genome diversity without excessive compute. Across extensive benchmarks, dnaGrinder achieves state-of-the-art or competitive accuracy and MCC on a broad set of genomic tasks while reducing parameters and compute relative to prior models, and it supports extremely long input sequences on workstation GPUs. These advances offer practical opportunities for large-scale genomic analysis in research and clinical contexts, enabling long-context sequence modeling with accessible hardware requirements.

Abstract

The task of understanding and interpreting the complex information encoded within genomic sequences remains a grand challenge in biological research and clinical applications. In this context, recent advancements in large language model research have led to the development of both encoder-only and decoder-only foundation models designed to decode intricate information in DNA sequences. However, several issues persist, particularly regarding the efficient management of long-range dependencies inherent in genomic sequences, the effective representation of nucleotide variations, and the considerable computational costs associated with large model architectures and extensive pretraining datasets. Current genomic foundation models often face a critical tradeoff: smaller models with mediocre performance versus large models with improved performance. To address these challenges, we introduce dnaGrinder, a unique and efficient genomic foundation model. dnaGrinder excels at managing long-range dependencies within genomic sequences while minimizing computational costs without compromising performance. It achieves results that are not just comparable but often superior to leading DNA models such as Nucleotide Transformer and DNABERT-2. Furthermore, dnaGrinder is designed for easy fine-tuning on workstation-grade GPUs, accommodating input lengths exceeding 17,000 tokens. On a single high-performance GPU, it supports sequences longer than 140,000 tokens, making it a highly efficient and accessible tool for both basic biological research and clinical applications.

dnaGrinder: a lightweight and high-capacity genomic foundation model

TL;DR

dnaGrinder introduces an encoder-only genomic foundation model that efficiently handles long-range dependencies in DNA by combining memory-efficient BPE tokenization, sequence length warmup, ALiBi attention, and Flash Attention 2. It leverages multispecies pretraining plus human 1000 Genomes variants and employs data augmentation and parental variant locus replacement to expand genome diversity without excessive compute. Across extensive benchmarks, dnaGrinder achieves state-of-the-art or competitive accuracy and MCC on a broad set of genomic tasks while reducing parameters and compute relative to prior models, and it supports extremely long input sequences on workstation GPUs. These advances offer practical opportunities for large-scale genomic analysis in research and clinical contexts, enabling long-context sequence modeling with accessible hardware requirements.

Abstract

The task of understanding and interpreting the complex information encoded within genomic sequences remains a grand challenge in biological research and clinical applications. In this context, recent advancements in large language model research have led to the development of both encoder-only and decoder-only foundation models designed to decode intricate information in DNA sequences. However, several issues persist, particularly regarding the efficient management of long-range dependencies inherent in genomic sequences, the effective representation of nucleotide variations, and the considerable computational costs associated with large model architectures and extensive pretraining datasets. Current genomic foundation models often face a critical tradeoff: smaller models with mediocre performance versus large models with improved performance. To address these challenges, we introduce dnaGrinder, a unique and efficient genomic foundation model. dnaGrinder excels at managing long-range dependencies within genomic sequences while minimizing computational costs without compromising performance. It achieves results that are not just comparable but often superior to leading DNA models such as Nucleotide Transformer and DNABERT-2. Furthermore, dnaGrinder is designed for easy fine-tuning on workstation-grade GPUs, accommodating input lengths exceeding 17,000 tokens. On a single high-performance GPU, it supports sequences longer than 140,000 tokens, making it a highly efficient and accessible tool for both basic biological research and clinical applications.
Paper Structure (33 sections, 3 equations, 2 figures, 8 tables)

This paper contains 33 sections, 3 equations, 2 figures, 8 tables.

Figures (2)

  • Figure 1: Sketches of (a) the architecture and (b) characteristics of dnaGrinder.
  • Figure 2: Variants and family trios.