Table of Contents
Fetching ...

SiTSE: Sinhala Text Simplification Dataset and Evaluation

Surangika Ranathunga, Rumesh Sirithunga, Himashi Rathnayake, Lahiru De Silva, Thamindu Aluthwala, Saman Peramuna, Ravi Shekhar

TL;DR

This work addresses the scarcity of sentence-level text simplification resources for Sinhala by introducing SiTSE, a dataset with 1000 complex Sinhala sentences and 3000 manually simplified references. It frames Sinhala TS as a zero-resource, zero-shot problem and systematically evaluates sqPLMs (mBART, mT5) using ITTL with auxiliary tasks (translation, paraphrasing, and English simplification) to improve performance. Through automatic metrics (SARI, BERTScore) and comprehensive human evaluation, the study shows ITTL outperforms prior zero-resource baselines and that translation-first ITTL provides strong gains, though no single metric fully captures quality. The work highlights practical implications for low-resource languages and underscores the need for richer evaluation frameworks, while outlining future work to scale data and explore multi-task learning and LLMs.

Abstract

Text Simplification is a task that has been minimally explored for low-resource languages. Consequently, there are only a few manually curated datasets. In this paper, we present a human curated sentence-level text simplification dataset for the Sinhala language. Our evaluation dataset contains 1,000 complex sentences and corresponding 3,000 simplified sentences produced by three different human annotators. We model the text simplification task as a zero-shot and zero resource sequence-to-sequence (seq-seq) task on the multilingual language models mT5 and mBART. We exploit auxiliary data from related seq-seq tasks and explore the possibility of using intermediate task transfer learning (ITTL). Our analysis shows that ITTL outperforms the previously proposed zero-resource methods for text simplification. Our findings also highlight the challenges in evaluating text simplification systems, and support the calls for improved metrics for measuring the quality of automated text simplification systems that would suit low-resource languages as well. Our code and data are publicly available: https://github.com/brainsharks-fyp17/Sinhala-Text-Simplification-Dataset-and-Evaluation

SiTSE: Sinhala Text Simplification Dataset and Evaluation

TL;DR

This work addresses the scarcity of sentence-level text simplification resources for Sinhala by introducing SiTSE, a dataset with 1000 complex Sinhala sentences and 3000 manually simplified references. It frames Sinhala TS as a zero-resource, zero-shot problem and systematically evaluates sqPLMs (mBART, mT5) using ITTL with auxiliary tasks (translation, paraphrasing, and English simplification) to improve performance. Through automatic metrics (SARI, BERTScore) and comprehensive human evaluation, the study shows ITTL outperforms prior zero-resource baselines and that translation-first ITTL provides strong gains, though no single metric fully captures quality. The work highlights practical implications for low-resource languages and underscores the need for richer evaluation frameworks, while outlining future work to scale data and explore multi-task learning and LLMs.

Abstract

Text Simplification is a task that has been minimally explored for low-resource languages. Consequently, there are only a few manually curated datasets. In this paper, we present a human curated sentence-level text simplification dataset for the Sinhala language. Our evaluation dataset contains 1,000 complex sentences and corresponding 3,000 simplified sentences produced by three different human annotators. We model the text simplification task as a zero-shot and zero resource sequence-to-sequence (seq-seq) task on the multilingual language models mT5 and mBART. We exploit auxiliary data from related seq-seq tasks and explore the possibility of using intermediate task transfer learning (ITTL). Our analysis shows that ITTL outperforms the previously proposed zero-resource methods for text simplification. Our findings also highlight the challenges in evaluating text simplification systems, and support the calls for improved metrics for measuring the quality of automated text simplification systems that would suit low-resource languages as well. Our code and data are publicly available: https://github.com/brainsharks-fyp17/Sinhala-Text-Simplification-Dataset-and-Evaluation

Paper Structure

This paper contains 34 sections, 4 figures, 6 tables.

Figures (4)

  • Figure 1: A sample sentence from SiTSE dataset with English translations
  • Figure 2: ITTL with (a) a single intermediate task and (b) a sequence of intermediate tasks.
  • Figure 3: Visual representation of the number of occurrences of each error type that exist in model outputs and human-curated data.
  • Figure 4: Sample sentences generated from three models