ClinText-SP and RigoBERTa Clinical: a new set of open resources for Spanish Clinical NLP
Guillem García Subies, Álvaro Barbero Jiménez, Paloma Martínez Fernández
TL;DR
The paper tackles the scarcity of open Spanish clinical data by introducing ClinText-SP, the largest open Spanish clinical corpus, and a domain-adapted encoder, RigoBERTa Clinical. It constructs ClinText-SP from journals, annotated corpora, and supplementary sources, then pretrains RigoBERTa 2 on this corpus to yield a specialized clinical Spanish model. Empirical evaluations across Spanish clinical benchmarks show RigoBertA Clinical consistently surpasses prior Spanish and multilingual encoders, with an average micro-F1 improvement around 0.01. By publicly releasing both the dataset and the model, the work aims to accelerate research and real-world deployment in Spanish clinical NLP, while outlining future work on efficiency, architectural refinements, and multimodal data integration.
Abstract
We present a novel contribution to Spanish clinical natural language processing by introducing the largest publicly available clinical corpus, ClinText-SP, along with a state-of-the-art clinical encoder language model, RigoBERTa Clinical. Our corpus was meticulously curated from diverse open sources, including clinical cases from medical journals and annotated corpora from shared tasks, providing a rich and diverse dataset that was previously difficult to access. RigoBERTa Clinical, developed through domain-adaptive pretraining on this comprehensive dataset, significantly outperforms existing models on multiple clinical NLP benchmarks. By publicly releasing both the dataset and the model, we aim to empower the research community with robust resources that can drive further advancements in clinical NLP and ultimately contribute to improved healthcare applications.
