Recent advances in deep learning and language models for studying the microbiome
Binghao Yan, Yunbi Nam, Lingyao Li, Rebecca A. Deek, Hongzhe Li, Siyuan Ma
TL;DR
The paper surveys how deep learning and language modeling, including large language models, are transforming microbiome and metagenomics research. It covers protein-language models for novel protein generation and function/structure prediction, DNA/genomic language models for contig/genome-scale context, and specialized applications in virome annotation, virus–host interactions, and biosynthetic gene cluster discovery. It highlights foundational datasets and model architectures, such as DNABERT, ESMFold, ProGen, and DeepBGC, and discusses the role of public knowledge integration in mining microbe–disease associations. The discussion identifies key future directions, including multi-omics data fusion, standardized benchmarks, and new architectures (e.g., graph-based, hierarchical) to better capture the complex dependencies of microbial communities.
Abstract
Recent advancements in deep learning, particularly large language models (LLMs), made a significant impact on how researchers study microbiome and metagenomics data. Microbial protein and genomic sequences, like natural languages, form a language of life, enabling the adoption of LLMs to extract useful insights from complex microbial ecologies. In this paper, we review applications of deep learning and language models in analyzing microbiome and metagenomics data. We focus on problem formulations, necessary datasets, and the integration of language modeling techniques. We provide an extensive overview of protein/genomic language modeling and their contributions to microbiome studies. We also discuss applications such as novel viromics language modeling, biosynthetic gene cluster prediction, and knowledge integration for metagenomics studies.
