Lexical Complexity Prediction and Lexical Simplification for Catalan and Spanish: Resource Creation, Quality Assessment, and Ethical Considerations
Stefan Bott, Horacio Saggion, Nelson Peréz Rojas, Martin Solis Salazar, Saul Calderon Ramirez
TL;DR
The paper tackles the lack of high-quality lexical simplification and lexical complexity datasets for Spanish and, critically, Catalan. It describes parallel data-collection pipelines that produce two new resources: a Catalan LS dataset and a Spanish LS/LCP resource with scalar complexity ratings, curated from educational texts and a sentence-simplification corpus, respectively. The authors assess data quality through inter-annotator agreement and in-context substitution analyses, and address ethical considerations associated with substitutions and potential biases. These datasets provide essential benchmarks for Lexical Simplification and Lexical Complexity Prediction in Iberian Romance languages and set the stage for future shared tasks and improvements in ethical data practices.
Abstract
Automatic lexical simplification is a task to substitute lexical items that may be unfamiliar and difficult to understand with easier and more common words. This paper presents the description and analysis of two novel datasets for lexical simplification in Spanish and Catalan. This dataset represents the first of its kind in Catalan and a substantial addition to the sparse data on automatic lexical simplification which is available for Spanish. Specifically, it is the first dataset for Spanish which includes scalar ratings of the understanding difficulty of lexical items. In addition, we present a detailed analysis aiming at assessing the appropriateness and ethical dimensions of the data for the lexical simplification task.
