Do LLMs Implicitly Determine the Suitable Text Difficulty for Users?
Seiji Gobara, Hidetaka Kamigaito, Taro Watanabe
TL;DR
The paper investigates whether LLMs can implicitly determine the appropriate text difficulty for users by aligning generated explanations to user input without explicit comprehension prompts. It evaluates a broad set of open-source and proprietary models on two datasets, Stack-Overflow QA and TSCC, using metrics such as text-difficulty correlations, BERTScore for synonymity, and redundancy measures, with prompts designed to avoid explicit user-level cues. The findings indicate that instruction-tuning generally yields stronger alignment between input and output difficulty than model size, with GPT-3.5/4 achieving high performance and some open models matching or surpassing human baselines in specific settings. The work highlights the potential for zero-shot, instruction-tuned models to support personalized education, while also noting limitations related to domain coverage and evaluation methodology and outlining future directions for broader-domain data and cross-language assessment.
Abstract
Education that suits the individual learning level is necessary to improve students' understanding. The first step in achieving this purpose by using large language models (LLMs) is to adjust the textual difficulty of the response to students. This work analyzes how LLMs can implicitly adjust text difficulty between user input and its generated text. To conduct the experiments, we created a new dataset from Stack-Overflow to explore the performance of question-answering-based conversation. Experimental results on the Stack-Overflow dataset and the TSCC dataset, including multi-turn conversation show that LLMs can implicitly handle text difficulty between user input and its generated response. We also observed that some LLMs can surpass humans in handling text difficulty and the importance of instruction-tuning.
