Table of Contents
Fetching ...

Automatic detection of diseases in Spanish clinical notes combining medical language models and ontologies

Leon-Paul Schaub Torre, Pelayo Quiros, Helena Garcia Mieres

TL;DR

The paper tackles automatic disease detection from Spanish dermatology EHR notes by proposing a hybrid method that fuses a Spanish medical language model with medical ontologies in cascade models. A new anonymized dataset of 8881 dermatology reports across 173 pathologies is introduced, and the approach learns intermediate semantic aspects (site, type, severity) before predicting the exact pathology. Ontology-enhanced cascade models achieve state-of-the-art results (precision ≈ 0.84, micro F1 ≈ 0.82, macro F1 ≈ 0.75; top-2 accuracy ≈ 0.92), significantly outperforming vanilla baselines, and demonstrating the utility of external knowledge in low-resource languages. The work provides a public dataset and outlines future directions incorporating retrieval-augmented techniques and NER-inspired negation handling to further improve performance and generalization.

Abstract

In this paper we present a hybrid method for the automatic detection of dermatological pathologies in medical reports. We use a large language model combined with medical ontologies to predict, given a first appointment or follow-up medical report, the pathology a person may suffer from. The results show that teaching the model to learn the type, severity and location on the body of a dermatological pathology, as well as in which order it has to learn these three features, significantly increases its accuracy. The article presents the demonstration of state-of-the-art results for classification of medical texts with a precision of 0.84, micro and macro F1-score of 0.82 and 0.75, and makes both the method and the data set used available to the community.

Automatic detection of diseases in Spanish clinical notes combining medical language models and ontologies

TL;DR

The paper tackles automatic disease detection from Spanish dermatology EHR notes by proposing a hybrid method that fuses a Spanish medical language model with medical ontologies in cascade models. A new anonymized dataset of 8881 dermatology reports across 173 pathologies is introduced, and the approach learns intermediate semantic aspects (site, type, severity) before predicting the exact pathology. Ontology-enhanced cascade models achieve state-of-the-art results (precision ≈ 0.84, micro F1 ≈ 0.82, macro F1 ≈ 0.75; top-2 accuracy ≈ 0.92), significantly outperforming vanilla baselines, and demonstrating the utility of external knowledge in low-resource languages. The work provides a public dataset and outlines future directions incorporating retrieval-augmented techniques and NER-inspired negation handling to further improve performance and generalization.

Abstract

In this paper we present a hybrid method for the automatic detection of dermatological pathologies in medical reports. We use a large language model combined with medical ontologies to predict, given a first appointment or follow-up medical report, the pathology a person may suffer from. The results show that teaching the model to learn the type, severity and location on the body of a dermatological pathology, as well as in which order it has to learn these three features, significantly increases its accuracy. The article presents the demonstration of state-of-the-art results for classification of medical texts with a precision of 0.84, micro and macro F1-score of 0.82 and 0.75, and makes both the method and the data set used available to the community.

Paper Structure

This paper contains 26 sections, 2 equations, 6 figures, 8 tables, 2 algorithms.

Figures (6)

  • Figure 1: Dataset example. On the left, the first consultation or follow-up report. On the right, the pathology to be predicted.
  • Figure 2: Graphical representation of the partition generated for the validation of the anonymization performed
  • Figure 3: Architecture of our method (in red the training-only stages, in green the training and inference stages)
  • Figure 4: Distribution of diseases in the generated dataset
  • Figure 5: Confusion Matrix for model A
  • ...and 1 more figures