Ontology-based Semantic Similarity Measures for Clustering Medical Concepts in Drug Safety
Jeffery L Painter, François Haguinet, Gregory E Powell, Andrew Bate
TL;DR
This work tackles improving pharmacovigilance by clustering MedDRA adverse-event terms using ontology-based semantic similarity measures. It benchmarks six SSMs over a MedDRA+SNOMED-CT ontology network via UMLS and delivers a scalable high-throughput framework with cross-language interfaces. The main finding is that intrinsic information content-based measures—particularly INTRINSIC_LIN, INTRINSIC_LCH, and SOKAL—achieve the best clustering accuracy (F1 around 0.403–0.404) and outperform path-based methods, validated against expert reviews and SMQs. The approach offers a practical, ontology-grounded enhancement to automated safety signal detection, potentially reducing manual review workflows in pharmacovigilance.
Abstract
Semantic similarity measures (SSMs) are widely used in biomedical research but remain underutilized in pharmacovigilance. This study evaluates six ontology-based SSMs for clustering MedDRA Preferred Terms (PTs) in drug safety data. Using the Unified Medical Language System (UMLS), we assess each method's ability to group PTs around medically meaningful centroids. A high-throughput framework was developed with a Java API and Python and R interfaces support large-scale similarity computations. Results show that while path-based methods perform moderately with F1 scores of 0.36 for WUPALMER and 0.28 for LCH, intrinsic information content (IC)-based measures, especially INTRINSIC-LIN and SOKAL, consistently yield better clustering accuracy (F1 score of 0.403). Validated against expert review and standard MedDRA queries (SMQs), our findings highlight the promise of IC-based SSMs in enhancing pharmacovigilance workflows by improving early signal detection and reducing manual review.
