Language Resources in Spanish for Automatic Text Simplification across Domains
Antonio Moreno-Sandoval, Leonardo Campillos-Llanos, Ana García-Serrano
TL;DR
The paper addresses the challenge of automatic text simplification for Spanish across Finance, Medicine, and History by building domain-specific corpora, annotation/guidelines, and lexicons, plus DL models and two end-to-end simplification tools. It reports notable resources such as CLARA-DM, FinT-esp2020, and CLARA-MeD, along with SimpFin and SimpMedLexSp, and shares datasets used in shared tasks like FinTOC, FNS, and FinCausal. The findings show that while lexicon-enhanced DL models improve performance, non-neural or risky outputs (hallucinations) remain an issue, underscoring the need for expert supervision and retrieval-augmented approaches. Overall, the work provides publicly available resources and benchmarks that underpin safer, domain-aware text simplification for Spanish and sets the stage for future improvements in accuracy and reliability.
Abstract
This work describes the language resources and models developed for automatic simplification of Spanish texts in three domains: Finance, Medicine and History studies. We created several corpora in each domain, annotation and simplification guidelines, a lexicon of technical and simplified medical terms, datasets used in shared tasks for the financial domain, and two simplification tools. The methodology, resources and companion publications are shared publicly on the web-site: https://clara-nlp.uned.es/.
