Readability Measures and Automatic Text Simplification: In the Search of a Construct
Rémi Cardon, A. Seza Doğruöz
TL;DR
This study interrogates how standard readability measures relate to automatic evaluation metrics and human judgments in Automatic Text Simplification (ATS). Using SimplicityDA (sentence-level) and D-Wikipedia (document-level) data, the authors combine 1,066 readability features with traditional metrics to assess correlations with human judgments (fluency, simplicity, meaning preservation) and with ATS metrics (BLEU, SARI, BERTScore, D-SARI, LENS). They find generally weak correlations between readability measures and both human judgments and automatic metrics, with somewhat stronger associations for delta-based comparisons and meaning preservation, and notably stronger signals for BERTScore on sentence-level data. The results highlight a disconnect among the three axes used to evaluate simplification, underscoring the need for a well-defined ATS construct and more rigorous alignment between readability assessment and ATS evaluation. Limitations include data volume and language scope, suggesting that broader datasets and standardized evaluation protocols are needed for robust conclusions and practical impact.
Abstract
Readability is a key concept in the current era of abundant written information. To help making texts more readable and make information more accessible to everyone, a line of researched aims at making texts accessible for their target audience: automatic text simplification (ATS). Lately, there have been studies on the correlations between automatic evaluation metrics in ATS and human judgment. However, the correlations between those two aspects and commonly available readability measures (such as readability formulas or linguistic features) have not been the focus of as much attention. In this work, we investigate the place of readability measures in ATS by complementing the existing studies on evaluation metrics and human judgment, on English. We first discuss the relationship between ATS and research in readability, then we report a study on correlations between readability measures and human judgment, and between readability measures and ATS evaluation metrics. We identify that in general, readability measures do not correlate well with automatic metrics and human judgment. We argue that as the three different angles from which simplification can be assessed tend to exhibit rather low correlations with one another, there is a need for a clear definition of the construct in ATS.
