Hierarchical Retrieval with Out-Of-Vocabulary Queries: A Case Study on SNOMED CT
Jonathon Dilworth, Hui Yang, Jiaoyan Chen, Yongsheng Gao
TL;DR
This work tackles the challenge of retrieving SNOMED CT concepts when queries are out-of-vocabulary (OOV) by reframing retrieval as hierarchical (HR) reasoning. It leverages language-model-based ontology embeddings in hyperbolic space, via HiT and OnT, to infer subsumption relationships and rank candidates with a depth-biased score. Across experiments with an OOV dataset derived from MIRAGE, the OnT model consistently outperforms lexical baselines and SBERT, with strong gains as the permissible hop distance $d$ increases. The approach is demonstrated to be generalizable to other ontologies, offering practical benefits for clinical decision support and terminology navigation, and the authors release code, tools, and data for reuse.
Abstract
SNOMED CT is a biomedical ontology with a hierarchical representation of large-scale concepts. Knowledge retrieval in SNOMED CT is critical for its application, but often proves challenging due to language ambiguity, synonyms, polysemies and so on. This problem is exacerbated when the queries are out-of-vocabulary (OOV), i.e., having no equivalent matchings in the ontology. In this work, we focus on the problem of hierarchical concept retrieval from SNOMED CT with OOV queries, and propose an approach based on language model-based ontology embeddings. For evaluation, we construct OOV queries annotated against SNOMED CT concepts, testing the retrieval of the most direct subsumers and their less relevant ancestors. We find that our method outperforms the baselines including SBERT and two lexical matching methods. While evaluated against SNOMED CT, the approach is generalisable and can be extended to other ontologies. We release code, tools, and evaluation datasets at https://github.com/jonathondilworth/HR-OOV.
