Table of Contents
Fetching ...

Whole Genome Transformer for Gene Interaction Effects in Microbiome Habitat Specificity

Zhufeng Li, Sandeep S Cranganore, Nicholas Youngblut, Niki Kilbertus

TL;DR

This work tackles predicting microbiome habitat specificity from whole-genome sequences, a challenging genotype-phenotype problem driven by complex gene interactions. It introduces a genome-scale transformer that operates on fixed-size gene embeddings derived from a large protein language model (ESM-2), representing each genome as a sequence of gene tokens and learning habitat-specific patterns. The method achieves strong habitat classification performance on ProGenomes v3 and provides attribution-based gene interaction networks that recover known interactions and propose new candidates for experimental follow-up. By leveraging sequence-level information and gene co-presence patterns, the approach offers interpretable insights into how microbial genes collectively shape environmental adaptation, with potential implications for environmental, agricultural, and medical applications.

Abstract

Leveraging the vast genetic diversity within microbiomes offers unparalleled insights into complex phenotypes, yet the task of accurately predicting and understanding such traits from genomic data remains challenging. We propose a framework taking advantage of existing large models for gene vectorization to predict habitat specificity from entire microbial genome sequences. Based on our model, we develop attribution techniques to elucidate gene interaction effects that drive microbial adaptation to diverse environments. We train and validate our approach on a large dataset of high quality microbiome genomes from different habitats. We not only demonstrate solid predictive performance, but also how sequence-level information of entire genomes allows us to identify gene associations underlying complex phenotypes. Our attribution recovers known important interaction networks and proposes new candidates for experimental follow up.

Whole Genome Transformer for Gene Interaction Effects in Microbiome Habitat Specificity

TL;DR

This work tackles predicting microbiome habitat specificity from whole-genome sequences, a challenging genotype-phenotype problem driven by complex gene interactions. It introduces a genome-scale transformer that operates on fixed-size gene embeddings derived from a large protein language model (ESM-2), representing each genome as a sequence of gene tokens and learning habitat-specific patterns. The method achieves strong habitat classification performance on ProGenomes v3 and provides attribution-based gene interaction networks that recover known interactions and propose new candidates for experimental follow-up. By leveraging sequence-level information and gene co-presence patterns, the approach offers interpretable insights into how microbial genes collectively shape environmental adaptation, with potential implications for environmental, agricultural, and medical applications.

Abstract

Leveraging the vast genetic diversity within microbiomes offers unparalleled insights into complex phenotypes, yet the task of accurately predicting and understanding such traits from genomic data remains challenging. We propose a framework taking advantage of existing large models for gene vectorization to predict habitat specificity from entire microbial genome sequences. Based on our model, we develop attribution techniques to elucidate gene interaction effects that drive microbial adaptation to diverse environments. We train and validate our approach on a large dataset of high quality microbiome genomes from different habitats. We not only demonstrate solid predictive performance, but also how sequence-level information of entire genomes allows us to identify gene associations underlying complex phenotypes. Our attribution recovers known important interaction networks and proposes new candidates for experimental follow up.
Paper Structure (29 sections, 9 figures, 8 tables)

This paper contains 29 sections, 9 figures, 8 tables.

Figures (9)

  • Figure 1: A conceptual overview of our data preprocessing pipeline. Each sample stands for an entire genome, reconstructed from shotgun sequencing in terms of contiguous consensus regions (contigs). We identify all genes within each contig (using Prodigal) and embed the corresponding protein sequences using an existing protein large language model (ESM-2) into a $d_{\mathrm{emb}}$-dimensional vector space. A single 'input example', corresponding to an entire genome, is ultimately represented by a $(n_j \times d_{\mathrm{emb}})$-dimensional tensor.
  • Figure 2: Left: Histogram of the number of contigs per sample (genome). Center: Histogram of the number of genes per sample (genome). Right: Histogram of the number of genes per contig.
  • Figure 3: A conceptual overview of our training and attribution pipelines. Training: We feed the $(n_j \times d_{\mathrm{emb}})$-dimensional inputs to our transformer, interpreted as a sequence of $n_j$ 'tokens', each already represented by a fixed $d_{\mathrm{emb}}$-dimensional embedding. We randomly shuffle the contigs within each sample, since the 'correct' order is unknown. Our model is then trained with the cross-entropy loss for classification. Attribution: After training, we extract the last-layer attention maps for all validation samples. We find the indices of the top-$k$ attention scores in each map, i.e., which gene embedding attends strongly to which other gene embedding. We cluster these pairs and visualize the clustering via non-linear dimensionality reduction. Within each cluster, we then re-identify the nucleotide sequences of all genes within all pairs and match them against gene annotation databases.
  • Figure 4: Two-dimensional visualization of the clusters for aquatic (left), host (middle), and soil (right) samples via UMAP mcinnes2018umap, omitting points not belonging to any cluster.
  • Figure 5: Gene interaction network for the sample 1311.SAMN14644158; coral/violet: genes with more than one neighbor (hubs); blue: genes with one neighbor (peripheral); Purple hubs are described in the text. Numbers are in order of appearance on the genome.
  • ...and 4 more figures