Table of Contents
Fetching ...

Language Resources in Spanish for Automatic Text Simplification across Domains

Antonio Moreno-Sandoval, Leonardo Campillos-Llanos, Ana García-Serrano

TL;DR

The paper addresses the challenge of automatic text simplification for Spanish across Finance, Medicine, and History by building domain-specific corpora, annotation/guidelines, and lexicons, plus DL models and two end-to-end simplification tools. It reports notable resources such as CLARA-DM, FinT-esp2020, and CLARA-MeD, along with SimpFin and SimpMedLexSp, and shares datasets used in shared tasks like FinTOC, FNS, and FinCausal. The findings show that while lexicon-enhanced DL models improve performance, non-neural or risky outputs (hallucinations) remain an issue, underscoring the need for expert supervision and retrieval-augmented approaches. Overall, the work provides publicly available resources and benchmarks that underpin safer, domain-aware text simplification for Spanish and sets the stage for future improvements in accuracy and reliability.

Abstract

This work describes the language resources and models developed for automatic simplification of Spanish texts in three domains: Finance, Medicine and History studies. We created several corpora in each domain, annotation and simplification guidelines, a lexicon of technical and simplified medical terms, datasets used in shared tasks for the financial domain, and two simplification tools. The methodology, resources and companion publications are shared publicly on the web-site: https://clara-nlp.uned.es/.

Language Resources in Spanish for Automatic Text Simplification across Domains

TL;DR

The paper addresses the challenge of automatic text simplification for Spanish across Finance, Medicine, and History by building domain-specific corpora, annotation/guidelines, and lexicons, plus DL models and two end-to-end simplification tools. It reports notable resources such as CLARA-DM, FinT-esp2020, and CLARA-MeD, along with SimpFin and SimpMedLexSp, and shares datasets used in shared tasks like FinTOC, FNS, and FinCausal. The findings show that while lexicon-enhanced DL models improve performance, non-neural or risky outputs (hallucinations) remain an issue, underscoring the need for expert supervision and retrieval-augmented approaches. Overall, the work provides publicly available resources and benchmarks that underpin safer, domain-aware text simplification for Spanish and sets the stage for future improvements in accuracy and reliability.

Abstract

This work describes the language resources and models developed for automatic simplification of Spanish texts in three domains: Finance, Medicine and History studies. We created several corpora in each domain, annotation and simplification guidelines, a lexicon of technical and simplified medical terms, datasets used in shared tasks for the financial domain, and two simplification tools. The methodology, resources and companion publications are shared publicly on the web-site: https://clara-nlp.uned.es/.
Paper Structure (17 sections, 3 figures, 1 table)

This paper contains 17 sections, 3 figures, 1 table.

Figures (3)

  • Figure 1: Overview of CLARA-NLP.
  • Figure 2: Hard-copy of a manually annotated text in the Tagtog tool; image from salidoetal2023.
  • Figure 3: Simplification tools for the financial domain (left) and for the medical domain (right).