Table of Contents
Fetching ...

METAGENE-1: Metagenomic Foundation Model for Pandemic Monitoring

Ollie Liu, Sami Jaghouar, Johannes Hagemann, Shangshang Wang, Jason Wiemels, Jeff Kaufman, Willie Neiswanger

TL;DR

METAGENE-1 introduces a 7B decoder-only transformer pretrained on a novel metagenomic wastewater corpus totaling over 1.5 trillion base pairs to enable pandemic monitoring and pathogen detection. Using a BPE tokenizer trained on short 100–300 base-pair reads, the model achieves state-of-the-art performance on pathogen-detection and genomic-embedding benchmarks, demonstrating strong generalization across diverse sequencing conditions. The study emphasizes safety, open science, and continual pretraining to broaden applicability, while noting limitations tied to short-read data and distributional scope. The work highlights the potential of metagenomic foundation models for biosurveillance, anomaly detection, and early threat monitoring in public health contexts.

Abstract

We pretrain METAGENE-1, a 7-billion-parameter autoregressive transformer model, which we refer to as a metagenomic foundation model, on a novel corpus of diverse metagenomic DNA and RNA sequences comprising over 1.5 trillion base pairs. This dataset is sourced from a large collection of human wastewater samples, processed and sequenced using deep metagenomic (next-generation) sequencing methods. Unlike genomic models that focus on individual genomes or curated sets of specific species, the aim of METAGENE-1 is to capture the full distribution of genomic information present within this wastewater, to aid in tasks relevant to pandemic monitoring and pathogen detection. We carry out byte-pair encoding (BPE) tokenization on our dataset, tailored for metagenomic sequences, and then pretrain our model. In this paper, we first detail the pretraining dataset, tokenization strategy, and model architecture, highlighting the considerations and design choices that enable the effective modeling of metagenomic data. We then show results of pretraining this model on our metagenomic dataset, providing details about our losses, system metrics, and training stability over the course of pretraining. Finally, we demonstrate the performance of METAGENE-1, which achieves state-of-the-art results on a set of genomic benchmarks and new evaluations focused on human-pathogen detection and genomic sequence embedding, showcasing its potential for public health applications in pandemic monitoring, biosurveillance, and early detection of emerging health threats.

METAGENE-1: Metagenomic Foundation Model for Pandemic Monitoring

TL;DR

METAGENE-1 introduces a 7B decoder-only transformer pretrained on a novel metagenomic wastewater corpus totaling over 1.5 trillion base pairs to enable pandemic monitoring and pathogen detection. Using a BPE tokenizer trained on short 100–300 base-pair reads, the model achieves state-of-the-art performance on pathogen-detection and genomic-embedding benchmarks, demonstrating strong generalization across diverse sequencing conditions. The study emphasizes safety, open science, and continual pretraining to broaden applicability, while noting limitations tied to short-read data and distributional scope. The work highlights the potential of metagenomic foundation models for biosurveillance, anomaly detection, and early threat monitoring in public health contexts.

Abstract

We pretrain METAGENE-1, a 7-billion-parameter autoregressive transformer model, which we refer to as a metagenomic foundation model, on a novel corpus of diverse metagenomic DNA and RNA sequences comprising over 1.5 trillion base pairs. This dataset is sourced from a large collection of human wastewater samples, processed and sequenced using deep metagenomic (next-generation) sequencing methods. Unlike genomic models that focus on individual genomes or curated sets of specific species, the aim of METAGENE-1 is to capture the full distribution of genomic information present within this wastewater, to aid in tasks relevant to pandemic monitoring and pathogen detection. We carry out byte-pair encoding (BPE) tokenization on our dataset, tailored for metagenomic sequences, and then pretrain our model. In this paper, we first detail the pretraining dataset, tokenization strategy, and model architecture, highlighting the considerations and design choices that enable the effective modeling of metagenomic data. We then show results of pretraining this model on our metagenomic dataset, providing details about our losses, system metrics, and training stability over the course of pretraining. Finally, we demonstrate the performance of METAGENE-1, which achieves state-of-the-art results on a set of genomic benchmarks and new evaluations focused on human-pathogen detection and genomic sequence embedding, showcasing its potential for public health applications in pandemic monitoring, biosurveillance, and early detection of emerging health threats.
Paper Structure (25 sections, 7 figures, 7 tables)

This paper contains 25 sections, 7 figures, 7 tables.

Figures (7)

  • Figure 1: Overview of METAGENE-1 and applications. Wastewater samples are collected and undergo deep metagenomic sequencing to generate DNA and RNA sequences totaling over 1.5 trillion base pairs. These sequences are tokenized using byte-pair encoding (BPE) to create the pretraining dataset. The data is used to train METAGENE-1, a 7B-parameter transformer model that enables a wide range of metagenomic analysis and monitoring applications.
  • Figure 2: Overview of the metagenomic data collection and sequencing pipeline for model pretraining. The process begins with the collection of wastewater (left), which contains genomic fragments from a diverse collection (e.g., tens of thousands) of constituent organisms (center). These samples are processed via high-throughput metagenomic sequencing to produce millions of paired-end reads (right), each consisting of hundreds of base pairs. The complete dataset comprises over 1.5 trillion base pairs of metagenomic sequences used for model pretraining.
  • Figure 3: Metagenomic composition of the METAGENE-1 pretraining dataset, estimated via Kraken 2wood2019improved sequence classification, and visualized via Kronaondov2011interactive. See Figure \ref{['fig:data-snapshots-figure']} for a more-detailed view.
  • Figure 4: We show $z$-loss during pretraining, which aids and gives an indicator of stability.
  • Figure 5: METAGENE-1 loss curves during pretraining. We show training loss (left), and validation loss on a held out metagenomic sample (right).
  • ...and 2 more figures