Automatic detection of diseases in Spanish clinical notes combining medical language models and ontologies
Leon-Paul Schaub Torre, Pelayo Quiros, Helena Garcia Mieres
TL;DR
The paper tackles automatic disease detection from Spanish dermatology EHR notes by proposing a hybrid method that fuses a Spanish medical language model with medical ontologies in cascade models. A new anonymized dataset of 8881 dermatology reports across 173 pathologies is introduced, and the approach learns intermediate semantic aspects (site, type, severity) before predicting the exact pathology. Ontology-enhanced cascade models achieve state-of-the-art results (precision ≈ 0.84, micro F1 ≈ 0.82, macro F1 ≈ 0.75; top-2 accuracy ≈ 0.92), significantly outperforming vanilla baselines, and demonstrating the utility of external knowledge in low-resource languages. The work provides a public dataset and outlines future directions incorporating retrieval-augmented techniques and NER-inspired negation handling to further improve performance and generalization.
Abstract
In this paper we present a hybrid method for the automatic detection of dermatological pathologies in medical reports. We use a large language model combined with medical ontologies to predict, given a first appointment or follow-up medical report, the pathology a person may suffer from. The results show that teaching the model to learn the type, severity and location on the body of a dermatological pathology, as well as in which order it has to learn these three features, significantly increases its accuracy. The article presents the demonstration of state-of-the-art results for classification of medical texts with a precision of 0.84, micro and macro F1-score of 0.82 and 0.75, and makes both the method and the data set used available to the community.
