Table of Contents
Fetching ...

Readability Formulas, Systems and LLMs are Poor Predictors of Reading Ease

Keren Gruteke Klein, Shachar Frenkel, Omer Shubi, Yevgeni Berzak

TL;DR

This study questions the effectiveness of widely used readability formulas, NLP-based metrics, frontier LLMs, and commercial educational tools in predicting real-time reading ease. It introduces a content-controlled evaluation framework that uses eye-tracking data from a large English reading corpus (OneStopL1&L2) to measure ΔReadingEase across original and simplified texts and assesses how well ΔScore_M predicts ΔReadingEase_E via Pearson correlations, with $ ext{Eval}_M = ext{Pearson}_{corr}(\Delta\text{Score}_{M,T}, \Delta\text{ReadingEase}_{E,T})$. Across reader groups and reading regimes, surprisal and other psycholinguistic measures outperform traditional readability methods, with entropy and the big three word properties often providing the strongest guidance; surprisal is typically the best overall predictor. The findings argue for cognitively grounded readability metrics and highlight the limitations of current ARA approaches for capturing online reading experiences, suggesting new directions that integrate psycholinguistic insights and robust online measures.

Abstract

Methods for scoring text readability have been studied for over a century, and are widely used in research and in user-facing applications in many domains. Thus far, the development and evaluation of such methods have primarily relied on two types of offline behavioral data, performance on reading comprehension tests and ratings of text readability levels. In this work, we instead focus on a fundamental and understudied aspect of readability, real-time reading ease, captured with online reading measures using eye tracking. We introduce an evaluation framework for readability scoring methods which quantifies their ability to account for reading ease, while controlling for content variation across texts. Applying this evaluation to prominent traditional readability formulas, modern machine learning systems, frontier Large Language Models and commercial systems used in education, suggests that they are all poor predictors of reading ease in English. This outcome holds across native and non-native speakers, reading regimes, and textual units of different lengths. The evaluation further reveals that existing methods are often outperformed by word properties commonly used in psycholinguistics for prediction of reading times. Our results highlight a fundamental limitation of existing approaches to readability scoring, the utility of psycholinguistics for readability research, and the need for new, cognitively driven readability scoring approaches that can better account for reading ease.

Readability Formulas, Systems and LLMs are Poor Predictors of Reading Ease

TL;DR

This study questions the effectiveness of widely used readability formulas, NLP-based metrics, frontier LLMs, and commercial educational tools in predicting real-time reading ease. It introduces a content-controlled evaluation framework that uses eye-tracking data from a large English reading corpus (OneStopL1&L2) to measure ΔReadingEase across original and simplified texts and assesses how well ΔScore_M predicts ΔReadingEase_E via Pearson correlations, with . Across reader groups and reading regimes, surprisal and other psycholinguistic measures outperform traditional readability methods, with entropy and the big three word properties often providing the strongest guidance; surprisal is typically the best overall predictor. The findings argue for cognitively grounded readability metrics and highlight the limitations of current ARA approaches for capturing online reading experiences, suggesting new directions that integrate psycholinguistic insights and robust online measures.

Abstract

Methods for scoring text readability have been studied for over a century, and are widely used in research and in user-facing applications in many domains. Thus far, the development and evaluation of such methods have primarily relied on two types of offline behavioral data, performance on reading comprehension tests and ratings of text readability levels. In this work, we instead focus on a fundamental and understudied aspect of readability, real-time reading ease, captured with online reading measures using eye tracking. We introduce an evaluation framework for readability scoring methods which quantifies their ability to account for reading ease, while controlling for content variation across texts. Applying this evaluation to prominent traditional readability formulas, modern machine learning systems, frontier Large Language Models and commercial systems used in education, suggests that they are all poor predictors of reading ease in English. This outcome holds across native and non-native speakers, reading regimes, and textual units of different lengths. The evaluation further reveals that existing methods are often outperformed by word properties commonly used in psycholinguistics for prediction of reading times. Our results highlight a fundamental limitation of existing approaches to readability scoring, the utility of psycholinguistics for readability research, and the need for new, cognitively driven readability scoring approaches that can better account for reading ease.

Paper Structure

This paper contains 37 sections, 6 equations, 13 figures, 3 tables.

Figures (13)

  • Figure 1: Evaluation of readability scoring methods and psycholinguistic measures using their predictivity of reading ease. Years denote the publication year of each method. The evaluations use three eye movement measures of reading ease: TF (the sum of all fixation durations on fixated words), SR (the fraction of skipped words), and RR (the number of backward saccades per word). Each bar depicts the Pearson correlation $r$ coefficient of $\Delta \text{Score}_{M,T}$ with $\Delta \text{ReadingEase}_{E,T}$, where $\text{Score}_{M,T}$ is the readability score difference between pairs of original and simplified texts $T$ according to method $M$, and $\Delta \text{ReadingEase}_{E,T}$ is the average reading ease difference according to measure $E$ on the same pairs of texts across participants. Error bars are 95% confidence intervals. Colors represent the statistical significance level of the correlation.
  • Figure A2: Pairwise statistical comparisons of the reading ease predictivity of readability scoring methods and psycholinguistic measures. Each cell $(i,j)$ compares the evaluation of readability formula $M_i$ with the evaluation of readability formula $M_j$. Significance is assessed using Steiger’s (1980) test for dependent overlapping correlations (see Section \ref{['sec:compare_readability']}). Cell color indicates the $p$ value of the test, with "Positive" and "Negative" denoting higher or lower correlations for the row measure relative to the column measure, respectively.
  • Figure A3: Pairwise Pearson correlation between different readability scoring methods and psycholinguistic measures. Each cell $(i,j)$ is colored by the Pearson $r$ correlation between the readability score differences $\Delta\text{Score}_{M_i,T}$ and $\Delta\text{Score}_{M_j,T}$ produced by methods $M_i$ and $M_j$, measured across all the textual items $T$ in OneStopL1&L2.
  • Figure A4: Reading ease predictivity of readability scoring methods and psycholinguistic measures for L1 and L2 speakers. Each pair of bars shows the Pearson correlation $r$ for L2 readers (top, striped) and L1 readers (bottom). To the right of each pair, we display the difference in correlations (L2-L1) together with its statistical significance, assessed using a Steiger correlation test (see Section \ref{['sec:compare_L1-L2']}). Significant differences are colored in green if L2 > L1 and in blue if L1 > L2. Bar colors indicate the statistical significance level of the correlation.
  • Figure A5: Reading ease predictivity of readability scoring methods and psycholinguistic measures for ordinary reading and information seeking reading regimes. Each pair of bars shows the Pearson correlation $r$ for information seeking readers (top, striped) and ordinary reading readers (bottom). To the right of each pair, we display the difference in correlations (ordinary reading $-$ information seeking) together with its statistical significance, assessed using a Steiger correlation test (see Section \ref{['sec:compare_hunting']}). Significant differences are colored in green if ordinary reading > information seeking and in blue if information seeking > ordinary reading. Bar colors indicate the statistical significance level of the correlation.
  • ...and 8 more figures