Table of Contents
Fetching ...

Mapping Biomedical Ontology Terms to IDs: Effect of Domain Prevalence on Prediction Accuracy

Thanh Son Do, Daniel B. Hier, Tayo Obafemi-Ajayi

TL;DR

This work evaluates GPT-4's ability to map biomedical ontology terms to IDs across HPO, GO, and UniProtKB and tests whether literature prevalence in the PMC corpus predicts mapping accuracy. The authors analyze correlations, binning, Zipf distributions, and ROC-based prevalence thresholds, comparing baseline accuracy with prevalence-informed Optimal models and using logistic regression to predict mapping probability from ID counts. They find that higher prevalence of IDs in literature predicts higher mapping accuracy for HPO IDs, GO IDs, and UniProtKB accession numbers, while mapping to HUGO symbols is not prevalence-dependent due to lexicalization. The results underscore limits of LLM-based ontology mapping in low-prevalence domains and suggest prevalence-aware evaluation and retrieval-augmented strategies, along with balanced test datasets for robust biomedical mapping.

Abstract

This study evaluates the ability of large language models (LLMs) to map biomedical ontology terms to their corresponding ontology IDs across the Human Phenotype Ontology (HPO), Gene Ontology (GO), and UniProtKB terminologies. Using counts of ontology IDs in the PubMed Central (PMC) dataset as a surrogate for their prevalence in the biomedical literature, we examined the relationship between ontology ID prevalence and mapping accuracy. Results indicate that ontology ID prevalence strongly predicts accurate mapping of HPO terms to HPO IDs, GO terms to GO IDs, and protein names to UniProtKB accession numbers. Higher prevalence of ontology IDs in the biomedical literature correlated with higher mapping accuracy. Predictive models based on receiver operating characteristic (ROC) curves confirmed this relationship. In contrast, this pattern did not apply to mapping protein names to Human Genome Organisation's (HUGO) gene symbols. GPT-4 achieved a high baseline performance (95%) in mapping protein names to HUGO gene symbols, with mapping accuracy unaffected by prevalence. We propose that the high prevalence of HUGO gene symbols in the literature has caused these symbols to become lexicalized, enabling GPT-4 to map protein names to HUGO gene symbols with high accuracy. These findings highlight the limitations of LLMs in mapping ontology terms to low-prevalence ontology IDs and underscore the importance of incorporating ontology ID prevalence into the training and evaluation of LLMs for biomedical applications.

Mapping Biomedical Ontology Terms to IDs: Effect of Domain Prevalence on Prediction Accuracy

TL;DR

This work evaluates GPT-4's ability to map biomedical ontology terms to IDs across HPO, GO, and UniProtKB and tests whether literature prevalence in the PMC corpus predicts mapping accuracy. The authors analyze correlations, binning, Zipf distributions, and ROC-based prevalence thresholds, comparing baseline accuracy with prevalence-informed Optimal models and using logistic regression to predict mapping probability from ID counts. They find that higher prevalence of IDs in literature predicts higher mapping accuracy for HPO IDs, GO IDs, and UniProtKB accession numbers, while mapping to HUGO symbols is not prevalence-dependent due to lexicalization. The results underscore limits of LLM-based ontology mapping in low-prevalence domains and suggest prevalence-aware evaluation and retrieval-augmented strategies, along with balanced test datasets for robust biomedical mapping.

Abstract

This study evaluates the ability of large language models (LLMs) to map biomedical ontology terms to their corresponding ontology IDs across the Human Phenotype Ontology (HPO), Gene Ontology (GO), and UniProtKB terminologies. Using counts of ontology IDs in the PubMed Central (PMC) dataset as a surrogate for their prevalence in the biomedical literature, we examined the relationship between ontology ID prevalence and mapping accuracy. Results indicate that ontology ID prevalence strongly predicts accurate mapping of HPO terms to HPO IDs, GO terms to GO IDs, and protein names to UniProtKB accession numbers. Higher prevalence of ontology IDs in the biomedical literature correlated with higher mapping accuracy. Predictive models based on receiver operating characteristic (ROC) curves confirmed this relationship. In contrast, this pattern did not apply to mapping protein names to Human Genome Organisation's (HUGO) gene symbols. GPT-4 achieved a high baseline performance (95%) in mapping protein names to HUGO gene symbols, with mapping accuracy unaffected by prevalence. We propose that the high prevalence of HUGO gene symbols in the literature has caused these symbols to become lexicalized, enabling GPT-4 to map protein names to HUGO gene symbols with high accuracy. These findings highlight the limitations of LLMs in mapping ontology terms to low-prevalence ontology IDs and underscore the importance of incorporating ontology ID prevalence into the training and evaluation of LLMs for biomedical applications.
Paper Structure (4 sections, 8 figures, 3 tables)

This paper contains 4 sections, 8 figures, 3 tables.

Figures (8)

  • Figure 1: Workflow for Biomedical Term Normalization. The process extracts terms from clinical notes, standardizes them to ontology terms, and maps them to ontology identifiers (e.g., HP:0001265).
  • Figure 2: ROC Curve for Mapping GO Terms to GO IDs. Using 1,839 GO terms from the Cellular Component (CC) hierarchy, the ROC curve (orange line) was computed to evaluate mapping accuracy. An optimal threshold of 4 ID counts in the PMC dataset (red circle) maximized sensitivity and specificity, achieving an AUC of 0.90. Similar curves were created for the HPO and UniProtKB terminologies (not shown).
  • Figure 3: Accuracy of Mapping HPO Term to HPO ID Predicted by Logistic Regression. The log of the ID count in PMC as the predictor (x-axis) was the predictor of accurate mapping (y-axis). Green markers represent correct mappings, and red markers represent incorrect mappings. The blue line shows the logistic regression fit, and the black dashed line represents the threshold probability of 0.5 used to classify mappings as likely accurate or not. Using this threshold, 57% of the 462 terms that were above the threshold were accurately mapped to their HPO IDs, while only 7% of 18,338 terms below the threshold were accurately mapped to their HPO IDs. Markers are jittered at lower ID frequencies to enhance visibility on the left side of the plot. Similar models were constructed for the GO and UniProtKB terminologies (not shown).
  • Figure 4: Accuracy of Mapping HPO Terms to HPO IDs. GPT-4 mapped 18,880 terms to their HPO IDs. Terms were rank-ordered by the count of their ontology IDs in the PMC. Terms were divided into 20 equal-sized bins with Bin 1 containing the terms with the highest HPO ID counts in the PMC. For each bin, the mean accuracy (y1-axis) was calculated and the mean HPO ID count was calculated (y2-axis). Note that only terms in the highest ID count bin (Bin 1) were mapped to their HPO IDs accurately. The remaining 19 bins showed high error rates.
  • Figure 5: Accuracy of Mapping GO Concepts to GO ID. GPT-4 mapped 1,839 cellular component terms to their GO IDs with 35% accuracy. GO terms were ranked according to counts of their GO ID in the PMC. Accuracy and PMC GO ID counts declined steadily and in tandem from bin 1 (highest) to bin 20 (lowest).
  • ...and 3 more figures