Table of Contents
Fetching ...

Beyond Flesch-Kincaid: Prompt-based Metrics Improve Difficulty Classification of Educational Texts

Donya Rooein, Paul Rottger, Anastassia Shaitarova, Dirk Hovy

TL;DR

This work addresses the challenge of reliably assessing whether LLM-generated educational content matches a given student education level, noting that static readability metrics are crude. It introduces Prompt-based metrics derived from a targeted user study and evaluates them, both individually and in combination with Static metrics, on the ScienceQA dataset. Regression analyses show that combining Prompt-based and Static metrics yields the best text-difficulty classification performance, with interpretable feature-importance insights highlighting readability, topic relevance, and content signaling as key factors. The approach offers a practical route to improve LLM-assisted education by enabling more accurate adaptation across elementary, middle, and high school content, while motivating broader data collection with educators for domain-specific metric refinement.

Abstract

Using large language models (LLMs) for educational applications like dialogue-based teaching is a hot topic. Effective teaching, however, requires teachers to adapt the difficulty of content and explanations to the education level of their students. Even the best LLMs today struggle to do this well. If we want to improve LLMs on this adaptation task, we need to be able to measure adaptation success reliably. However, current Static metrics for text difficulty, like the Flesch-Kincaid Reading Ease score, are known to be crude and brittle. We, therefore, introduce and evaluate a new set of Prompt-based metrics for text difficulty. Based on a user study, we create Prompt-based metrics as inputs for LLMs. They leverage LLM's general language understanding capabilities to capture more abstract and complex features than Static metrics. Regression experiments show that adding our Prompt-based metrics significantly improves text difficulty classification over Static metrics alone. Our results demonstrate the promise of using LLMs to evaluate text adaptation to different education levels.

Beyond Flesch-Kincaid: Prompt-based Metrics Improve Difficulty Classification of Educational Texts

TL;DR

This work addresses the challenge of reliably assessing whether LLM-generated educational content matches a given student education level, noting that static readability metrics are crude. It introduces Prompt-based metrics derived from a targeted user study and evaluates them, both individually and in combination with Static metrics, on the ScienceQA dataset. Regression analyses show that combining Prompt-based and Static metrics yields the best text-difficulty classification performance, with interpretable feature-importance insights highlighting readability, topic relevance, and content signaling as key factors. The approach offers a practical route to improve LLM-assisted education by enabling more accurate adaptation across elementary, middle, and high school content, while motivating broader data collection with educators for domain-specific metric refinement.

Abstract

Using large language models (LLMs) for educational applications like dialogue-based teaching is a hot topic. Effective teaching, however, requires teachers to adapt the difficulty of content and explanations to the education level of their students. Even the best LLMs today struggle to do this well. If we want to improve LLMs on this adaptation task, we need to be able to measure adaptation success reliably. However, current Static metrics for text difficulty, like the Flesch-Kincaid Reading Ease score, are known to be crude and brittle. We, therefore, introduce and evaluate a new set of Prompt-based metrics for text difficulty. Based on a user study, we create Prompt-based metrics as inputs for LLMs. They leverage LLM's general language understanding capabilities to capture more abstract and complex features than Static metrics. Regression experiments show that adding our Prompt-based metrics significantly improves text difficulty classification over Static metrics alone. Our results demonstrate the promise of using LLMs to evaluate text adaptation to different education levels.
Paper Structure (36 sections, 3 figures, 10 tables)

This paper contains 36 sections, 3 figures, 10 tables.

Figures (3)

  • Figure 1: Schematic overview of our approach to text difficulty classification. We calculate relevant Static and Prompt-based metrics for a given input text. Either or both metrics are then fed into a regression classifier that makes a final classification.
  • Figure 2: An illustrative example of the Prompt-based metric process. The green box contains the education text from the ScienceQA dataset. The blue box shows the predicted educational level and the explanation. The red box contains the Prompt-based metrics based on the sample.
  • Figure 3: High-level view of the derivation process for the Prompt-based metrics using n-gram frequencies. Function words are excluded.