M5: A Whole Genome Bacterial Encoder at Single Nucleotide Resolution
Agust Egilsson
TL;DR
M5 tackles the challenge of modeling bacterial genomes at single-nucleotide resolution with ultra-long context by introducing a linear attention Transformer encoder that scales to multi-million nucleotide inputs. The M5-small variant demonstrates that a low key-query dimension with many heads, coupled with a polynomial-based attention linearization and learned neighborhood position embeddings, enables training on segments up to $196{,}608$ nucleotides and inference on segments up to $2{,}000{,}000$ nucleotides on a single GPU. Empirical results show performance gains when training on longer sequences, in both cross-entropy and SNP-level accuracy, validating the linear approximation to softmax in long-context genomic modeling. The work lays groundwork for scalable, genome-scale language modeling with potential applications in genome annotation and comparative genomics, while outlining avenues for further optimization and multi-device scaling.
Abstract
A linear attention mechanism is described to extend the context length of an encoder only transformer, called M5 in this report, to a multi-million single nucleotide resolution foundation model pretrained on bacterial whole genomes. The linear attention mechanism used approximates a full quadratic attention mechanism tightly and has a simple and lightweight implementation for the use case when the key-query embedding dimensionality is low. The M5-small model is entirely trained and tested on one A100 GPU with 40gb of memory up to 196K nucleotides during training and 2M nucleotides during testing. We test the performance of the M5-small model and record notable improvements in performance as whole genome bacterial sequence lengths are increased as well as demonstrating the stability of the full multi-head attention approximation used as sequence length is increased.
