Table of Contents
Fetching ...

GoBERT: Gene Ontology Graph Informed BERT for Universal Gene Function Prediction

Yuwei Miao, Yuzhi Guo, Hehuan Ma, Jingquan Yan, Feng Jiang, Rui Liao, Junzhou Huang

TL;DR

GoBERT introduces a GO graph-informed BERT model for universal gene function prediction by combining explicit GO-DAG structure and semantic GO-term descriptions with implicit relation modeling through MLM. It employs two pre-training tasks: a self-supervised neighborhood prediction over the GO DAG to capture explicit relations, and a masked language modeling objective without positional encoding to uncover implicit function relations, optimized together as $\mathcal{L}^{\text{Total}}=\lambda\mathcal{L}^{\text{Ex}}+(1-\lambda)\mathcal{L}^{\text{Im}}$. The approach enables large-scale novel function prediction, achieving notable top-5 accuracy (e.g., $76.15\%$ at targeted depth) and providing biologically meaningful case studies and ablations that validate the contributions of explicit semantics, graph structure, and masking strategies. This work supports scalable, cross-species gene function annotation using known functions alone and suggests future enhancements with additional data modalities and non-annotated function incorporation for comprehensive GO coverage.

Abstract

Exploring the functions of genes and gene products is crucial to a wide range of fields, including medical research, evolutionary biology, and environmental science. However, discovering new functions largely relies on expensive and exhaustive wet lab experiments. Existing methods of automatic function annotation or prediction mainly focus on protein function prediction with sequence, 3D-structures or protein family information. In this study, we propose to tackle the gene function prediction problem by exploring Gene Ontology graph and annotation with BERT (GoBERT) to decipher the underlying relationships among gene functions. Our proposed novel function prediction task utilizes existing functions as inputs and generalizes the function prediction to gene and gene products. Specifically, two pre-train tasks are designed to jointly train GoBERT to capture both explicit and implicit relations of functions. Neighborhood prediction is a self-supervised multi-label classification task that captures the explicit function relations. Specified masking and recovering task helps GoBERT in finding implicit patterns among functions. The pre-trained GoBERT possess the ability to predict novel functions for various gene and gene products based on known functional annotations. Extensive experiments, biological case studies, and ablation studies are conducted to demonstrate the superiority of our proposed GoBERT.

GoBERT: Gene Ontology Graph Informed BERT for Universal Gene Function Prediction

TL;DR

GoBERT introduces a GO graph-informed BERT model for universal gene function prediction by combining explicit GO-DAG structure and semantic GO-term descriptions with implicit relation modeling through MLM. It employs two pre-training tasks: a self-supervised neighborhood prediction over the GO DAG to capture explicit relations, and a masked language modeling objective without positional encoding to uncover implicit function relations, optimized together as . The approach enables large-scale novel function prediction, achieving notable top-5 accuracy (e.g., at targeted depth) and providing biologically meaningful case studies and ablations that validate the contributions of explicit semantics, graph structure, and masking strategies. This work supports scalable, cross-species gene function annotation using known functions alone and suggests future enhancements with additional data modalities and non-annotated function incorporation for comprehensive GO coverage.

Abstract

Exploring the functions of genes and gene products is crucial to a wide range of fields, including medical research, evolutionary biology, and environmental science. However, discovering new functions largely relies on expensive and exhaustive wet lab experiments. Existing methods of automatic function annotation or prediction mainly focus on protein function prediction with sequence, 3D-structures or protein family information. In this study, we propose to tackle the gene function prediction problem by exploring Gene Ontology graph and annotation with BERT (GoBERT) to decipher the underlying relationships among gene functions. Our proposed novel function prediction task utilizes existing functions as inputs and generalizes the function prediction to gene and gene products. Specifically, two pre-train tasks are designed to jointly train GoBERT to capture both explicit and implicit relations of functions. Neighborhood prediction is a self-supervised multi-label classification task that captures the explicit function relations. Specified masking and recovering task helps GoBERT in finding implicit patterns among functions. The pre-trained GoBERT possess the ability to predict novel functions for various gene and gene products based on known functional annotations. Extensive experiments, biological case studies, and ablation studies are conducted to demonstrate the superiority of our proposed GoBERT.
Paper Structure (32 sections, 18 equations, 3 figures, 1 table)

This paper contains 32 sections, 18 equations, 3 figures, 1 table.

Figures (3)

  • Figure 1: Explicit relations between functions are depicted through GO DAG structures (edges) and semantic information (nodes). (a) The structure information can be represented by the adjacency matrix of the GO DAG, which serves as labels in neighborhood prediction. (b) Semantic information is captured by encoding raw text descriptions of each node in GO DAG with LLMs.
  • Figure 2: This figure illustrates the main components and framework of GoBERT. The white blocks represent pad tokens and the slashed blocks are mask tokens. Red, yellow, and blue blocks indicate functions belonging to Biological Processes, Molecular Functions, or Cellular Components categories, respectively. $D_{\text{train}}$ contains input data with N genes and k functions for each gene. Then, the designed masking strategy is applied. LLM-generated embedding is used in the initialization of token embedding $\mathbf{E}$ in GoBERT. For the implicit pre-train task, the $\mathcal{L}^{\text{Im}}$ is the loss between predicted mask functions and the ground truth functions in $D_{\text{train}}$. For the explicit pre-train task, labels $\mathbf{y}^{\text{Ex}}_i$ are obtained from adjacency matrix $\mathbf{A}$ of GO DAG, where $\mathcal{L}$ denotes the total number of nodes or functions. $\mathcal{L}^{\text{Ex}}$ is calculated for capturing the structural information of functions.
  • Figure 3: Implicit relations among gene functions are demonstrated by three examples. (a) Pleiotropy: multiple phenotypes can be controlled by a single gene, indicating there are underlying relationships between these functions. (b) A protein that contributes to the transportation biological process is potentially located on the membrane. (c) The same gene can produce different functional outcomes depending on the expression tissue, gene products may result in different functions in the liver and stomach.