Recent advances in deep learning and language models for studying the microbiome

Binghao Yan; Yunbi Nam; Lingyao Li; Rebecca A. Deek; Hongzhe Li; Siyuan Ma

Recent advances in deep learning and language models for studying the microbiome

Binghao Yan, Yunbi Nam, Lingyao Li, Rebecca A. Deek, Hongzhe Li, Siyuan Ma

TL;DR

The paper surveys how deep learning and language modeling, including large language models, are transforming microbiome and metagenomics research. It covers protein-language models for novel protein generation and function/structure prediction, DNA/genomic language models for contig/genome-scale context, and specialized applications in virome annotation, virus–host interactions, and biosynthetic gene cluster discovery. It highlights foundational datasets and model architectures, such as DNABERT, ESMFold, ProGen, and DeepBGC, and discusses the role of public knowledge integration in mining microbe–disease associations. The discussion identifies key future directions, including multi-omics data fusion, standardized benchmarks, and new architectures (e.g., graph-based, hierarchical) to better capture the complex dependencies of microbial communities.

Abstract

Recent advancements in deep learning, particularly large language models (LLMs), made a significant impact on how researchers study microbiome and metagenomics data. Microbial protein and genomic sequences, like natural languages, form a language of life, enabling the adoption of LLMs to extract useful insights from complex microbial ecologies. In this paper, we review applications of deep learning and language models in analyzing microbiome and metagenomics data. We focus on problem formulations, necessary datasets, and the integration of language modeling techniques. We provide an extensive overview of protein/genomic language modeling and their contributions to microbiome studies. We also discuss applications such as novel viromics language modeling, biosynthetic gene cluster prediction, and knowledge integration for metagenomics studies.

Recent advances in deep learning and language models for studying the microbiome

TL;DR

Abstract

Paper Structure (15 sections, 2 figures, 2 tables)

This paper contains 15 sections, 2 figures, 2 tables.

Introduction
Brief review of LLMs and their extension towards modeling the language of life
Language modeling of proteins, contigs, and genomes of the microbiome
Protein language models for novel protein generation
Protein language models for function and structure prediction
DNA language models at the genomic scale
Genomic language models contextualize genes and gene clusters.
Language models for virome annotation and virome-host interactions
Virome sequence annotation and identification
Deep learning and LLM methods for virome-host interaction
Deep learning and language models for prediction of biosynthetic gene clusters
Deep learning methods for BGC prediction
BGC prediction based on language models
Public knowledge integration in microbiome studies with LLMs
Discussion

Figures (2)

Figure 1: Review of protein/DNA/genomic language models as applied to metagenomic studies. A. Protein and genomic sequences share similar properties as natural language sequences, with amino acids or nucleotides as units of sequences ("tokens"). The complex dependency structure of protein/gene-level or genomic-scale sequences can then be modeled by language model techniques, such as the transformer-based attention mechanism for various downstream tasks. B. Review of encoder- and decoder-style transformer attention mechanisms and their applications in metagenomic studies. Decoder-style model architecture (similar to that of BERT) aims to provide a meaningful representation of genomic sequences and is useful for downstream predictive tasks. Encoder-style model architecture (similar to that of ChatGPT) generates new sequences given past tokens and is most useful for generative tasks such as novel protein design.
Figure 2: A comparison of BGC prediction based on pfam2vec embedding for Pfam level prediction and embedding based on PLMs for amino acid level prediction.

Recent advances in deep learning and language models for studying the microbiome

TL;DR

Abstract

Recent advances in deep learning and language models for studying the microbiome

Authors

TL;DR

Abstract

Table of Contents

Figures (2)