Table of Contents
Fetching ...

FGBERT: Function-Driven Pre-trained Gene Language Model for Metagenomics

ChenRui Duan, Zelin Zang, Yongjie Xu, Hang He, Zihan Liu, Siyuan Li, Zijia Song, Ju-Sheng Zheng, Stan Z. Li

TL;DR

FGBERT tackles metagenomic analysis by introducing a context-aware, protein-based gene tokenizer and two pre-training objectives. MGM models inter-gene context to resolve One-to-Many relationships, while TMC uses contrastive learning to align gene sequences with functions under Many-to-One mappings. Pre-trained on over 100 million sequences, FGBERT achieves state-of-the-art performance across gene, functional, bacterial, and environmental tasks, with case studies on ATP synthase and operons illustrating functional recognition and biological relevance. The approach offers scalable, function-driven representations for metagenomic interpretation and downstream functional annotation, with potential for integration of multi-omics data in the future.

Abstract

Metagenomic data, comprising mixed multi-species genomes, are prevalent in diverse environments like oceans and soils, significantly impacting human health and ecological functions. However, current research relies on K-mer, which limits the capture of structurally and functionally relevant gene contexts. Moreover, these approaches struggle with encoding biologically meaningful genes and fail to address the One-to-Many and Many-to-One relationships inherent in metagenomic data. To overcome these challenges, we introduce FGBERT, a novel metagenomic pre-trained model that employs a protein-based gene representation as a context-aware and structure-relevant tokenizer. FGBERT incorporates Masked Gene Modeling (MGM) to enhance the understanding of inter-gene contextual relationships and Triplet Enhanced Metagenomic Contrastive Learning (TMC) to elucidate gene sequence-function relationships. Pre-trained on over 100 million metagenomic sequences, FGBERT demonstrates superior performance on metagenomic datasets at four levels, spanning gene, functional, bacterial, and environmental levels and ranging from 1k to 213k input sequences. Case studies of ATP Synthase and Gene Operons highlight FGBERT's capability for functional recognition and its biological relevance in metagenomic research.

FGBERT: Function-Driven Pre-trained Gene Language Model for Metagenomics

TL;DR

FGBERT tackles metagenomic analysis by introducing a context-aware, protein-based gene tokenizer and two pre-training objectives. MGM models inter-gene context to resolve One-to-Many relationships, while TMC uses contrastive learning to align gene sequences with functions under Many-to-One mappings. Pre-trained on over 100 million sequences, FGBERT achieves state-of-the-art performance across gene, functional, bacterial, and environmental tasks, with case studies on ATP synthase and operons illustrating functional recognition and biological relevance. The approach offers scalable, function-driven representations for metagenomic interpretation and downstream functional annotation, with potential for integration of multi-omics data in the future.

Abstract

Metagenomic data, comprising mixed multi-species genomes, are prevalent in diverse environments like oceans and soils, significantly impacting human health and ecological functions. However, current research relies on K-mer, which limits the capture of structurally and functionally relevant gene contexts. Moreover, these approaches struggle with encoding biologically meaningful genes and fail to address the One-to-Many and Many-to-One relationships inherent in metagenomic data. To overcome these challenges, we introduce FGBERT, a novel metagenomic pre-trained model that employs a protein-based gene representation as a context-aware and structure-relevant tokenizer. FGBERT incorporates Masked Gene Modeling (MGM) to enhance the understanding of inter-gene contextual relationships and Triplet Enhanced Metagenomic Contrastive Learning (TMC) to elucidate gene sequence-function relationships. Pre-trained on over 100 million metagenomic sequences, FGBERT demonstrates superior performance on metagenomic datasets at four levels, spanning gene, functional, bacterial, and environmental levels and ranging from 1k to 213k input sequences. Case studies of ATP Synthase and Gene Operons highlight FGBERT's capability for functional recognition and its biological relevance in metagenomic research.
Paper Structure (26 sections, 8 equations, 7 figures, 20 tables)

This paper contains 26 sections, 8 equations, 7 figures, 20 tables.

Figures (7)

  • Figure 1: Motivaion. Two types of complex relationships between gene sequences and functions in metagenomics. One-to-Many problem means that the same gene may display different functions based on the genomic context; for example, ATP synthase works differently in plants, heterotrophic bacteria, and humans. Many-to-One problem shows that multiple genes may perform the same function; for instance, different genes from different bacteria, e.g., Cpf1, Cas1, etc., produce the same resistance function within the immune system CRISPR.
  • Figure 2: Overview of FGBERT. A metagenomic sequence $\mathcal{X}$ is converted into ordered protein-based gene representations $\mathcal{G}$ via a Context-Aware Tokenizer. Next, we pre-train a Gene Encoder with $\mathcal{L}_{\text{MGM}}$, 15% of these tokens are masked to predict labels $\mathcal{Y}$. Meanwhile, we introduce $\mathcal{L}_{\text{Tri}}$ to distinguish gene sequences. The data augmentation and negative sampling modules generate positive samples $\mathcal{G}^+$ and negative samples $\mathcal{G}^-$, respectively. Finally, after fine-tuning, FGBERT can handle various downstream tasks.
  • Figure 2: Description of Experimental Datasets.
  • Figure 3: Ablation studies of Our Proposed Modules on Four Downstream Tasks.
  • Figure 4: Classification Results on CARD-R.
  • ...and 2 more figures