UniEntrezDB: Large-scale Gene Ontology Annotation Dataset and Evaluation Benchmarks with Unified Entrez Gene Identifiers

Yuwei Miao; Yuzhi Guo; Hehuan Ma; Jingquan Yan; Feng Jiang; Weizhi An; Jean Gao; Junzhou Huang

UniEntrezDB: Large-scale Gene Ontology Annotation Dataset and Evaluation Benchmarks with Unified Entrez Gene Identifiers

Yuwei Miao, Yuzhi Guo, Hehuan Ma, Jingquan Yan, Feng Jiang, Weizhi An, Jean Gao, Junzhou Huang

TL;DR

UniEntrezDB addresses the dispersion of Gene Ontology Annotations across databases by unifying them under shared Entrez Gene IDs and delivering a GOA corpus plus four downstream benchmarks for gene, protein, and cell-level evaluation. The approach creates UniEntrezGOA by aggregating GOA from 21 databases and aligning entries to Entrez IDs across 1000+ species, and defines four evaluation tasks (Pathway Co-present, Functional Gene Interaction, PPI, and Single-Cell Type Annotation) to test embeddings. Baseline and composite embeddings that incorporate GOA information (GOA_Emb, GOA_Emb+Gene2Vec, GOA_Emb+DNABert) generally improve performance, while a purely protein-pretrained model (OntoProtein) struggles on gene-level tasks, highlighting the value of multi-source, unified ontological knowledge. The work furnishes a practical resource for integrating structured domain knowledge into gene-focused AI systems, supporting more reliable reasoning in protein design, drug discovery, and single-cell analysis.

Abstract

Gene studies are crucial for fields such as protein structure prediction, drug discovery, and cancer genomics, yet they face challenges in fully utilizing the vast and diverse information available. Gene studies require clean, factual datasets to ensure reliable results. Ontology graphs, neatly organized domain terminology graphs, provide ideal sources for domain facts. However, available gene ontology annotations are currently distributed across various databases without unified identifiers for genes and gene products. To address these challenges, we introduce Unified Entrez Gene Identifier Dataset and Benchmarks (UniEntrezDB), the first systematic effort to unify large-scale public Gene Ontology Annotations (GOA) from various databases using unique gene identifiers. UniEntrezDB includes a pre-training dataset and four downstream tasks designed to comprehensively evaluate gene embedding performance from gene, protein, and cell levels, ultimately enhancing the reliability and applicability of LLMs in gene research and other professional settings.

UniEntrezDB: Large-scale Gene Ontology Annotation Dataset and Evaluation Benchmarks with Unified Entrez Gene Identifiers

TL;DR

Abstract

UniEntrezDB: Large-scale Gene Ontology Annotation Dataset and Evaluation Benchmarks with Unified Entrez Gene Identifiers

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (4)