Readability Formulas, Systems and LLMs are Poor Predictors of Reading Ease
Keren Gruteke Klein, Shachar Frenkel, Omer Shubi, Yevgeni Berzak
TL;DR
This study questions the effectiveness of widely used readability formulas, NLP-based metrics, frontier LLMs, and commercial educational tools in predicting real-time reading ease. It introduces a content-controlled evaluation framework that uses eye-tracking data from a large English reading corpus (OneStopL1&L2) to measure ΔReadingEase across original and simplified texts and assesses how well ΔScore_M predicts ΔReadingEase_E via Pearson correlations, with $ ext{Eval}_M = ext{Pearson}_{corr}(\Delta\text{Score}_{M,T}, \Delta\text{ReadingEase}_{E,T})$. Across reader groups and reading regimes, surprisal and other psycholinguistic measures outperform traditional readability methods, with entropy and the big three word properties often providing the strongest guidance; surprisal is typically the best overall predictor. The findings argue for cognitively grounded readability metrics and highlight the limitations of current ARA approaches for capturing online reading experiences, suggesting new directions that integrate psycholinguistic insights and robust online measures.
Abstract
Methods for scoring text readability have been studied for over a century, and are widely used in research and in user-facing applications in many domains. Thus far, the development and evaluation of such methods have primarily relied on two types of offline behavioral data, performance on reading comprehension tests and ratings of text readability levels. In this work, we instead focus on a fundamental and understudied aspect of readability, real-time reading ease, captured with online reading measures using eye tracking. We introduce an evaluation framework for readability scoring methods which quantifies their ability to account for reading ease, while controlling for content variation across texts. Applying this evaluation to prominent traditional readability formulas, modern machine learning systems, frontier Large Language Models and commercial systems used in education, suggests that they are all poor predictors of reading ease in English. This outcome holds across native and non-native speakers, reading regimes, and textual units of different lengths. The evaluation further reveals that existing methods are often outperformed by word properties commonly used in psycholinguistics for prediction of reading times. Our results highlight a fundamental limitation of existing approaches to readability scoring, the utility of psycholinguistics for readability research, and the need for new, cognitively driven readability scoring approaches that can better account for reading ease.
