Table of Contents
Fetching ...

Generating Educational Materials with Different Levels of Readability using LLMs

Chieh-Yang Huang, Jing Wei, Ting-Hao 'Kenneth' Huang

TL;DR

This work defines the leveled-text generation task to rewrite educational materials at a target $Lexile$ level while preserving meaning and evaluates three large language models under zero- and few-shot prompting. It introduces a sizable parallel dataset and a two-pronged evaluation (readability alignment via $MAE$, $Match$, and $Direction$; content preservation via $BERTScore$ and semantic similarity). Findings show that few-shot prompting substantially improves performance, with LLaMA-2 70B best at reaching target readability and GPT-3.5 superior at preserving meaning, though issues like misinformation risk and uneven edits persist. The paper highlights important challenges for safe, reliable educational content generation and calls for further research and human-in-the-loop strategies to ensure quality and instructional value.

Abstract

This study introduces the leveled-text generation task, aiming to rewrite educational materials to specific readability levels while preserving meaning. We assess the capability of GPT-3.5, LLaMA-2 70B, and Mixtral 8x7B, to generate content at various readability levels through zero-shot and few-shot prompting. Evaluating 100 processed educational materials reveals that few-shot prompting significantly improves performance in readability manipulation and information preservation. LLaMA-2 70B performs better in achieving the desired difficulty range, while GPT-3.5 maintains original meaning. However, manual inspection highlights concerns such as misinformation introduction and inconsistent edit distribution. These findings emphasize the need for further research to ensure the quality of generated educational content.

Generating Educational Materials with Different Levels of Readability using LLMs

TL;DR

This work defines the leveled-text generation task to rewrite educational materials at a target level while preserving meaning and evaluates three large language models under zero- and few-shot prompting. It introduces a sizable parallel dataset and a two-pronged evaluation (readability alignment via , , and ; content preservation via and semantic similarity). Findings show that few-shot prompting substantially improves performance, with LLaMA-2 70B best at reaching target readability and GPT-3.5 superior at preserving meaning, though issues like misinformation risk and uneven edits persist. The paper highlights important challenges for safe, reliable educational content generation and calls for further research and human-in-the-loop strategies to ensure quality and instructional value.

Abstract

This study introduces the leveled-text generation task, aiming to rewrite educational materials to specific readability levels while preserving meaning. We assess the capability of GPT-3.5, LLaMA-2 70B, and Mixtral 8x7B, to generate content at various readability levels through zero-shot and few-shot prompting. Evaluating 100 processed educational materials reveals that few-shot prompting significantly improves performance in readability manipulation and information preservation. LLaMA-2 70B performs better in achieving the desired difficulty range, while GPT-3.5 maintains original meaning. However, manual inspection highlights concerns such as misinformation introduction and inconsistent edit distribution. These findings emphasize the need for further research to ensure the quality of generated educational content.
Paper Structure (18 sections, 2 equations, 2 figures, 1 table)

This paper contains 18 sections, 2 equations, 2 figures, 1 table.

Figures (2)

  • Figure 1: Scatter plots comparing intended and resulting Lexile scores for text generated by GPT-3.5, LLaMA-2 70B, and Mixtral 8x7B models in zero-shot and few-shot settings. The red-shaded area represents the region where resulting scores are within $\pm$50 points of the intended scores. A higher proportion of data points fall above the red area, indicating that the resulting Lexile scores tend to skew higher than the intended scores, suggesting a tendency for the models to generate slightly more complex text than the target difficulty level, regardless of the specific model or prompting approach used.
  • Figure 2: Scatter plots comparing intended and resulting Lexile shifts for text generated by GPT-3.5, LLaMA-2 70B, and Mixtral 8x7B models in zero-shot and few-shot settings. The Lexile shift is calculated as the difference between the intended or resulting Lexile score and the source Lexile score. Data points falling within the first and third quadrants indicate the correct direction of change in text complexity. However, the overall distribution of points still exhibits a skew towards higher difficulty levels, suggesting that the models tend to generate text that is slightly more complex than the intended shift, regardless of the specific model or prompting approach employed.