M5: A Whole Genome Bacterial Encoder at Single Nucleotide Resolution

Agust Egilsson

M5: A Whole Genome Bacterial Encoder at Single Nucleotide Resolution

Agust Egilsson

TL;DR

M5 tackles the challenge of modeling bacterial genomes at single-nucleotide resolution with ultra-long context by introducing a linear attention Transformer encoder that scales to multi-million nucleotide inputs. The M5-small variant demonstrates that a low key-query dimension with many heads, coupled with a polynomial-based attention linearization and learned neighborhood position embeddings, enables training on segments up to $196{,}608$ nucleotides and inference on segments up to $2{,}000{,}000$ nucleotides on a single GPU. Empirical results show performance gains when training on longer sequences, in both cross-entropy and SNP-level accuracy, validating the linear approximation to softmax in long-context genomic modeling. The work lays groundwork for scalable, genome-scale language modeling with potential applications in genome annotation and comparative genomics, while outlining avenues for further optimization and multi-device scaling.

Abstract

A linear attention mechanism is described to extend the context length of an encoder only transformer, called M5 in this report, to a multi-million single nucleotide resolution foundation model pretrained on bacterial whole genomes. The linear attention mechanism used approximates a full quadratic attention mechanism tightly and has a simple and lightweight implementation for the use case when the key-query embedding dimensionality is low. The M5-small model is entirely trained and tested on one A100 GPU with 40gb of memory up to 196K nucleotides during training and 2M nucleotides during testing. We test the performance of the M5-small model and record notable improvements in performance as whole genome bacterial sequence lengths are increased as well as demonstrating the stability of the full multi-head attention approximation used as sequence length is increased.

M5: A Whole Genome Bacterial Encoder at Single Nucleotide Resolution

TL;DR

nucleotides and inference on segments up to

nucleotides on a single GPU. Empirical results show performance gains when training on longer sequences, in both cross-entropy and SNP-level accuracy, validating the linear approximation to softmax in long-context genomic modeling. The work lays groundwork for scalable, genome-scale language modeling with potential applications in genome annotation and comparative genomics, while outlining avenues for further optimization and multi-device scaling.

Abstract

Paper Structure (15 sections, 22 equations, 5 figures)

This paper contains 15 sections, 22 equations, 5 figures.

Introduction
Background
Methods
Linear attention used by M5
Example
Compute required by the M5 attention mechanism
Position embeddings
Network training
M5-small training data
Network architecture
Results
Small M5 model
Improved predictions as context length increases
Validity of the linear approximation to the softmax
Discussion and Future Work

Figures (5)

Figure 1: Example approximation on the interval [-1,2]
Figure 2: Performance (CE Cross-Entropy) as context length is varied
Figure 3: Single nucleotide prediction model accuracy (SNPAcc)
Figure 4: Distribution of $m-\delta$ for $d_k=4$
Figure 5: Distribution of $m-\delta$ for N=1,024 vs N=2,000,000

M5: A Whole Genome Bacterial Encoder at Single Nucleotide Resolution

TL;DR

Abstract

M5: A Whole Genome Bacterial Encoder at Single Nucleotide Resolution

Authors

TL;DR

Abstract

Table of Contents

Figures (5)