Generating Educational Materials with Different Levels of Readability using LLMs
Chieh-Yang Huang, Jing Wei, Ting-Hao 'Kenneth' Huang
TL;DR
This work defines the leveled-text generation task to rewrite educational materials at a target $Lexile$ level while preserving meaning and evaluates three large language models under zero- and few-shot prompting. It introduces a sizable parallel dataset and a two-pronged evaluation (readability alignment via $MAE$, $Match$, and $Direction$; content preservation via $BERTScore$ and semantic similarity). Findings show that few-shot prompting substantially improves performance, with LLaMA-2 70B best at reaching target readability and GPT-3.5 superior at preserving meaning, though issues like misinformation risk and uneven edits persist. The paper highlights important challenges for safe, reliable educational content generation and calls for further research and human-in-the-loop strategies to ensure quality and instructional value.
Abstract
This study introduces the leveled-text generation task, aiming to rewrite educational materials to specific readability levels while preserving meaning. We assess the capability of GPT-3.5, LLaMA-2 70B, and Mixtral 8x7B, to generate content at various readability levels through zero-shot and few-shot prompting. Evaluating 100 processed educational materials reveals that few-shot prompting significantly improves performance in readability manipulation and information preservation. LLaMA-2 70B performs better in achieving the desired difficulty range, while GPT-3.5 maintains original meaning. However, manual inspection highlights concerns such as misinformation introduction and inconsistent edit distribution. These findings emphasize the need for further research to ensure the quality of generated educational content.
