Predicting Failures of LLMs to Link Biomedical Ontology Terms to Identifiers Evidence Across Models and Ontologies
Daniel B. Hier, Steven Keith Platt, Tayo Obafemi-Ajayi
TL;DR
The paper investigates why large language models struggle to map biomedical terms to their correct ontology identifiers. By analyzing HPO and GO-CC with two strong LLMs (GPT-4o and LLaMa 3.1 405B) and engineering nine predictive features, it shows that familiarity signals from term and identifier usage in the literature best predict successful linking, while intrinsic term structure plays a minor role. Results reveal significant ontology deserts and a tokenizer-related Leading 000 artifact that differ by ontology, with GO-CC showing higher accuracy due to greater exposure. The findings suggest practical strategies—including fine-tuning, retrieval-augmented mechanisms, and improved literature reporting—to improve ontology normalization and downstream biomedical information extraction and decision support.
Abstract
Large language models often perform well on biomedical NLP tasks but may fail to link ontology terms to their correct identifiers. We investigate why these failures occur by analyzing predictions across two major ontologies, Human Phenotype Ontology and Gene Ontology, and two high-performing models, GPT-4o and LLaMa 3.1 405B. We evaluate nine candidate features related to term familiarity, identifier usage, morphology, and ontology structure. Univariate and multivariate analyses show that exposure to ontology identifiers is the strongest predictor of linking success.
