Table of Contents
Fetching ...

Predicting Failures of LLMs to Link Biomedical Ontology Terms to Identifiers Evidence Across Models and Ontologies

Daniel B. Hier, Steven Keith Platt, Tayo Obafemi-Ajayi

TL;DR

The paper investigates why large language models struggle to map biomedical terms to their correct ontology identifiers. By analyzing HPO and GO-CC with two strong LLMs (GPT-4o and LLaMa 3.1 405B) and engineering nine predictive features, it shows that familiarity signals from term and identifier usage in the literature best predict successful linking, while intrinsic term structure plays a minor role. Results reveal significant ontology deserts and a tokenizer-related Leading 000 artifact that differ by ontology, with GO-CC showing higher accuracy due to greater exposure. The findings suggest practical strategies—including fine-tuning, retrieval-augmented mechanisms, and improved literature reporting—to improve ontology normalization and downstream biomedical information extraction and decision support.

Abstract

Large language models often perform well on biomedical NLP tasks but may fail to link ontology terms to their correct identifiers. We investigate why these failures occur by analyzing predictions across two major ontologies, Human Phenotype Ontology and Gene Ontology, and two high-performing models, GPT-4o and LLaMa 3.1 405B. We evaluate nine candidate features related to term familiarity, identifier usage, morphology, and ontology structure. Univariate and multivariate analyses show that exposure to ontology identifiers is the strongest predictor of linking success.

Predicting Failures of LLMs to Link Biomedical Ontology Terms to Identifiers Evidence Across Models and Ontologies

TL;DR

The paper investigates why large language models struggle to map biomedical terms to their correct ontology identifiers. By analyzing HPO and GO-CC with two strong LLMs (GPT-4o and LLaMa 3.1 405B) and engineering nine predictive features, it shows that familiarity signals from term and identifier usage in the literature best predict successful linking, while intrinsic term structure plays a minor role. Results reveal significant ontology deserts and a tokenizer-related Leading 000 artifact that differ by ontology, with GO-CC showing higher accuracy due to greater exposure. The findings suggest practical strategies—including fine-tuning, retrieval-augmented mechanisms, and improved literature reporting—to improve ontology normalization and downstream biomedical information extraction and decision support.

Abstract

Large language models often perform well on biomedical NLP tasks but may fail to link ontology terms to their correct identifiers. We investigate why these failures occur by analyzing predictions across two major ontologies, Human Phenotype Ontology and Gene Ontology, and two high-performing models, GPT-4o and LLaMa 3.1 405B. We evaluate nine candidate features related to term familiarity, identifier usage, morphology, and ontology structure. Univariate and multivariate analyses show that exposure to ontology identifiers is the strongest predictor of linking success.

Paper Structure

This paper contains 4 sections, 1 equation, 8 figures, 4 tables.

Figures (8)

  • Figure 1: Univariate predictors of successful term-to-identifier linking for HPO. Bars show the difference in mean standardized (z-score) feature values between correctly and incorrectly linked terms. Features are sorted by absolute effect size, with the largest differences shown at the top. Positive bars (e.g., Annotation Count) indicate higher values for correctly linked terms, while negative bars (e.g., No Annotations) indicate higher values for incorrect links. Results are shown for GPT-4o (blue) and LLaMa 3.1 405B (orange), which demonstrate broadly similar patterns.
  • Figure 2: Univariate predictors of successful term-to-identifier linking for GO-CC. Bars show the difference in mean standardized (z-score) feature values between correctly and incorrectly linked terms. Features are sorted by absolute effect size, with the largest differences shown at the top. Positive bars (e.g., Annotation Count) indicate higher values for correctly linked terms, while negative bars (e.g., No Annotations) indicate higher values for incorrect links. Results are shown for GPT-4o (blue) and LLaMa 3.1 405B (orange), which demonstrate broadly similar patterns.
  • Figure 3: Standardized logistic regression coefficients for model to predict successful linking of HPO terms to their identifiers for LLaMa 3.1 405B and GPT-4o. Both models showed a similar pattern with Annotation Count and Leading 000 as the largest positive coefficients and No Annotations as the largest negative coefficient in the models.
  • Figure 4: Standardized logistic regression coefficients for model to predict linking of GO term to GO identifiers. Coefficents are ranked from largest to smallest. LLaMa 3.1 405B model and GPT-4o model had a similar pattern with PMC Indentifiers and Annotation Count having the biggest coefficients.
  • Figure 5: Annotation count predicts linking success for HPO terms. Results shown for LLaMa 3.1 405B. Terms with no annotations (bin 0) have the highest failure rates. Note the distribution of annotations for terms is Zipfian with many terms with few counts (bin 0) and few terms with many counts (see bins to the far right). Dark blue shading shows terms correctly linked to ID, light blue shading shows failed linkings.
  • ...and 3 more figures