Table of Contents
Fetching ...

Simple is not Enough: Document-level Text Simplification using Readability and Coherence

Laura Vásquez-Rodríguez, Nhung T. H. Nguyen, Piotr Przybyła, Matthew Shardlow, Sophia Ananiadou

TL;DR

This work tackles document-level text simplification by jointly optimizing simplicity, readability, and coherence. It uses a T5-based text-to-text framework with control tokens (e.g., "simplify:" and "read classify:") and adds a coherence reward to guide generation, training under zero-shot, few-shot, and fine-tuning regimes. Key contributions include adapting professionally annotated corpora for multi-task training, proposing a joint loss that combines simplification and readability (and coherence), and evaluating across multiple datasets (Newsela, D-Wikipedia, GCDC) to analyze the benefits and limitations of incorporating coherence into document-level TS. The findings show that multi-task learning and document-level pretraining can improve readability and, in some settings, document-level simplification, but fluency and coherence remain challenging, indicating a need for larger, more diverse coherence resources and further methodological refinements. Overall, the paper advances the integration of discourse-level factors into document-level TS and highlights practical considerations for achieving controllable, audience-tailored simplifications.

Abstract

In this paper, we present the SimDoc system, a simplification model considering simplicity, readability, and discourse aspects, such as coherence. In the past decade, the progress of the Text Simplification (TS) field has been mostly shown at a sentence level, rather than considering paragraphs or documents, a setting from which most TS audiences would benefit. We propose a simplification system that is initially fine-tuned with professionally created corpora. Further, we include multiple objectives during training, considering simplicity, readability, and coherence altogether. Our contributions include the extension of professionally annotated simplification corpora by the association of existing annotations into (complex text, simple text, readability label) triples to benefit from readability during training. Also, we present a comparative analysis in which we evaluate our proposed models in a zero-shot, few-shot, and fine-tuning setting using document-level TS corpora, demonstrating novel methods for simplification. Finally, we show a detailed analysis of outputs, highlighting the difficulties of simplification at a document level.

Simple is not Enough: Document-level Text Simplification using Readability and Coherence

TL;DR

This work tackles document-level text simplification by jointly optimizing simplicity, readability, and coherence. It uses a T5-based text-to-text framework with control tokens (e.g., "simplify:" and "read classify:") and adds a coherence reward to guide generation, training under zero-shot, few-shot, and fine-tuning regimes. Key contributions include adapting professionally annotated corpora for multi-task training, proposing a joint loss that combines simplification and readability (and coherence), and evaluating across multiple datasets (Newsela, D-Wikipedia, GCDC) to analyze the benefits and limitations of incorporating coherence into document-level TS. The findings show that multi-task learning and document-level pretraining can improve readability and, in some settings, document-level simplification, but fluency and coherence remain challenging, indicating a need for larger, more diverse coherence resources and further methodological refinements. Overall, the paper advances the integration of discourse-level factors into document-level TS and highlights practical considerations for achieving controllable, audience-tailored simplifications.

Abstract

In this paper, we present the SimDoc system, a simplification model considering simplicity, readability, and discourse aspects, such as coherence. In the past decade, the progress of the Text Simplification (TS) field has been mostly shown at a sentence level, rather than considering paragraphs or documents, a setting from which most TS audiences would benefit. We propose a simplification system that is initially fine-tuned with professionally created corpora. Further, we include multiple objectives during training, considering simplicity, readability, and coherence altogether. Our contributions include the extension of professionally annotated simplification corpora by the association of existing annotations into (complex text, simple text, readability label) triples to benefit from readability during training. Also, we present a comparative analysis in which we evaluate our proposed models in a zero-shot, few-shot, and fine-tuning setting using document-level TS corpora, demonstrating novel methods for simplification. Finally, we show a detailed analysis of outputs, highlighting the difficulties of simplification at a document level.

Paper Structure

This paper contains 41 sections, 1 equation, 3 figures, 8 tables.

Figures (3)

  • Figure 1: TS model architecture. We input the complex (for TS) and the simple gold standard text (for readability) in the model. Predicted simplifications are used in the loss and coherence evaluation. The predicted readability labels are used for readability loss. Finally, we propose a combined loss using TS, readability and coherence.
  • Figure 2: SBERT-based model architecture
  • Figure 3: SetFit model architecture Tunstall-2022. ST stands for Sentence Transformers.