Table of Contents
Fetching ...

Measuring and Modifying the Readability of English Texts with GPT-4

Sean Trott, Pamela D. Rivière

TL;DR

Evidence is found to support the hypothesis that GPT-4 Turbo can reliably make text easier or harder to read, and to discuss the limitations of this approach, including limited scope, as well as the validity of the ``readability'' construct.

Abstract

The success of Large Language Models (LLMs) in other domains has raised the question of whether LLMs can reliably assess and manipulate the readability of text. We approach this question empirically. First, using a published corpus of 4,724 English text excerpts, we find that readability estimates produced ``zero-shot'' from GPT-4 Turbo and GPT-4o mini exhibit relatively high correlation with human judgments (r = 0.76 and r = 0.74, respectively), out-performing estimates derived from traditional readability formulas and various psycholinguistic indices. Then, in a pre-registered human experiment (N = 59), we ask whether Turbo can reliably make text easier or harder to read. We find evidence to support this hypothesis, though considerable variance in human judgments remains unexplained. We conclude by discussing the limitations of this approach, including limited scope, as well as the validity of the ``readability'' construct and its dependence on context, audience, and goal.

Measuring and Modifying the Readability of English Texts with GPT-4

TL;DR

Evidence is found to support the hypothesis that GPT-4 Turbo can reliably make text easier or harder to read, and to discuss the limitations of this approach, including limited scope, as well as the validity of the ``readability'' construct.

Abstract

The success of Large Language Models (LLMs) in other domains has raised the question of whether LLMs can reliably assess and manipulate the readability of text. We approach this question empirically. First, using a published corpus of 4,724 English text excerpts, we find that readability estimates produced ``zero-shot'' from GPT-4 Turbo and GPT-4o mini exhibit relatively high correlation with human judgments (r = 0.76 and r = 0.74, respectively), out-performing estimates derived from traditional readability formulas and various psycholinguistic indices. Then, in a pre-registered human experiment (N = 59), we ask whether Turbo can reliably make text easier or harder to read. We find evidence to support this hypothesis, though considerable variance in human judgments remains unexplained. We conclude by discussing the limitations of this approach, including limited scope, as well as the validity of the ``readability'' construct and its dependence on context, audience, and goal.

Paper Structure

This paper contains 20 sections, 6 figures.

Figures (6)

  • Figure 1: Relationship between ratings elicited by GPT-4 Turbo and average human readability judgments ($R^2 = 0.58$).
  • Figure 2: Feature importance scores for each predictor, as determined using a random forest regression. A higher value indicates that this feature was more useful for predicting human readability judgments.
  • Figure 3: Distribution of human readability judgments for each text condition.
  • Figure 4: Correlation matrix between all the variables considered in Study 1. Correlation coefficients have all been transformed to absolute values for easier comparison.
  • Figure 5: Comparison of Flesch readability for the original version and modified version, according to Turbo's instructions.
  • ...and 1 more figures