Table of Contents
Fetching ...

UniEntrezDB: Large-scale Gene Ontology Annotation Dataset and Evaluation Benchmarks with Unified Entrez Gene Identifiers

Yuwei Miao, Yuzhi Guo, Hehuan Ma, Jingquan Yan, Feng Jiang, Weizhi An, Jean Gao, Junzhou Huang

TL;DR

UniEntrezDB addresses the dispersion of Gene Ontology Annotations across databases by unifying them under shared Entrez Gene IDs and delivering a GOA corpus plus four downstream benchmarks for gene, protein, and cell-level evaluation. The approach creates UniEntrezGOA by aggregating GOA from 21 databases and aligning entries to Entrez IDs across 1000+ species, and defines four evaluation tasks (Pathway Co-present, Functional Gene Interaction, PPI, and Single-Cell Type Annotation) to test embeddings. Baseline and composite embeddings that incorporate GOA information (GOA_Emb, GOA_Emb+Gene2Vec, GOA_Emb+DNABert) generally improve performance, while a purely protein-pretrained model (OntoProtein) struggles on gene-level tasks, highlighting the value of multi-source, unified ontological knowledge. The work furnishes a practical resource for integrating structured domain knowledge into gene-focused AI systems, supporting more reliable reasoning in protein design, drug discovery, and single-cell analysis.

Abstract

Gene studies are crucial for fields such as protein structure prediction, drug discovery, and cancer genomics, yet they face challenges in fully utilizing the vast and diverse information available. Gene studies require clean, factual datasets to ensure reliable results. Ontology graphs, neatly organized domain terminology graphs, provide ideal sources for domain facts. However, available gene ontology annotations are currently distributed across various databases without unified identifiers for genes and gene products. To address these challenges, we introduce Unified Entrez Gene Identifier Dataset and Benchmarks (UniEntrezDB), the first systematic effort to unify large-scale public Gene Ontology Annotations (GOA) from various databases using unique gene identifiers. UniEntrezDB includes a pre-training dataset and four downstream tasks designed to comprehensively evaluate gene embedding performance from gene, protein, and cell levels, ultimately enhancing the reliability and applicability of LLMs in gene research and other professional settings.

UniEntrezDB: Large-scale Gene Ontology Annotation Dataset and Evaluation Benchmarks with Unified Entrez Gene Identifiers

TL;DR

UniEntrezDB addresses the dispersion of Gene Ontology Annotations across databases by unifying them under shared Entrez Gene IDs and delivering a GOA corpus plus four downstream benchmarks for gene, protein, and cell-level evaluation. The approach creates UniEntrezGOA by aggregating GOA from 21 databases and aligning entries to Entrez IDs across 1000+ species, and defines four evaluation tasks (Pathway Co-present, Functional Gene Interaction, PPI, and Single-Cell Type Annotation) to test embeddings. Baseline and composite embeddings that incorporate GOA information (GOA_Emb, GOA_Emb+Gene2Vec, GOA_Emb+DNABert) generally improve performance, while a purely protein-pretrained model (OntoProtein) struggles on gene-level tasks, highlighting the value of multi-source, unified ontological knowledge. The work furnishes a practical resource for integrating structured domain knowledge into gene-focused AI systems, supporting more reliable reasoning in protein design, drug discovery, and single-cell analysis.

Abstract

Gene studies are crucial for fields such as protein structure prediction, drug discovery, and cancer genomics, yet they face challenges in fully utilizing the vast and diverse information available. Gene studies require clean, factual datasets to ensure reliable results. Ontology graphs, neatly organized domain terminology graphs, provide ideal sources for domain facts. However, available gene ontology annotations are currently distributed across various databases without unified identifiers for genes and gene products. To address these challenges, we introduce Unified Entrez Gene Identifier Dataset and Benchmarks (UniEntrezDB), the first systematic effort to unify large-scale public Gene Ontology Annotations (GOA) from various databases using unique gene identifiers. UniEntrezDB includes a pre-training dataset and four downstream tasks designed to comprehensively evaluate gene embedding performance from gene, protein, and cell levels, ultimately enhancing the reliability and applicability of LLMs in gene research and other professional settings.

Paper Structure

This paper contains 21 sections, 2 equations, 4 figures, 2 tables.

Figures (4)

  • Figure 1: Illustration of the transcription, translation, and GO annotation process. Gene 3630maglott2005entrez transcribes into mRNA then translates into Human insulin proteinuniprot2019uniprot (UniProtKB id: P01308). GO terms are annotated with functions, i.e., Human insulin protein enables insulin receptor binding.
  • Figure 2: (a) Illustration of the GO DAG. Zoom in shows the edge relationship between GO terms: GO:000988 is a GO:0048856 ("embryonic pattern specification" is a "anatomical structure development"), GO:0007275 part of GO:0048856 ("multicellular organism development" part of "anatomical structure development") (b) Examples of GOA in GAF format.
  • Figure 3: Statistics information of UniEntrezGOA Dataset. More details are available in Appendix. a) Phylogenetic Tree of over 1000 species available in UniEntrezGOA Manually Reviewed Annotations. b) ID mapping procedure between different databases. The numbers on the arrow indicate the number of IDs mapped successfully from the source database to the target database. c) The distribution of increased manually reviewed GOA each year for each gene and gene product. d) According to the GO official website, there are six categories of Evidence Codes. Only the Electronic Annotation Evidence Code IEA is not manually reviewed annotations.
  • Figure 4: Benchmark Illustration. (a) Pathway Co-present Prediction: The relationship between pathways (Orange) and genes (Blue) in Msigdbliberzon2011molecular. Edges form between the pathway and the gene belongs to the pathway. (b) Functional Gene Interaction Prediction: PathwayCommons wong2021science10.1093/nar/gkz946 datasets contains 5 different interaction types between genes. (c) Protein-Protein Interaction: illustration of different protein interactions in STRING szklarczyk2019string. (d) Single-Cell Type Annotation: t-SNE visualization of zheng68k zheng2017massively. Each point indicates a cell embedding and there are 11 cell types.