Do LLMs Implicitly Determine the Suitable Text Difficulty for Users?

Seiji Gobara; Hidetaka Kamigaito; Taro Watanabe

Do LLMs Implicitly Determine the Suitable Text Difficulty for Users?

Seiji Gobara, Hidetaka Kamigaito, Taro Watanabe

TL;DR

The paper investigates whether LLMs can implicitly determine the appropriate text difficulty for users by aligning generated explanations to user input without explicit comprehension prompts. It evaluates a broad set of open-source and proprietary models on two datasets, Stack-Overflow QA and TSCC, using metrics such as text-difficulty correlations, BERTScore for synonymity, and redundancy measures, with prompts designed to avoid explicit user-level cues. The findings indicate that instruction-tuning generally yields stronger alignment between input and output difficulty than model size, with GPT-3.5/4 achieving high performance and some open models matching or surpassing human baselines in specific settings. The work highlights the potential for zero-shot, instruction-tuned models to support personalized education, while also noting limitations related to domain coverage and evaluation methodology and outlining future directions for broader-domain data and cross-language assessment.

Abstract

Education that suits the individual learning level is necessary to improve students' understanding. The first step in achieving this purpose by using large language models (LLMs) is to adjust the textual difficulty of the response to students. This work analyzes how LLMs can implicitly adjust text difficulty between user input and its generated text. To conduct the experiments, we created a new dataset from Stack-Overflow to explore the performance of question-answering-based conversation. Experimental results on the Stack-Overflow dataset and the TSCC dataset, including multi-turn conversation show that LLMs can implicitly handle text difficulty between user input and its generated response. We also observed that some LLMs can surpass humans in handling text difficulty and the importance of instruction-tuning.

Do LLMs Implicitly Determine the Suitable Text Difficulty for Users?

TL;DR

Abstract

Paper Structure (41 sections, 4 figures, 15 tables)

This paper contains 41 sections, 4 figures, 15 tables.

Introduction
Experimental Setup
Dataset
Stack-Overflow
TSCC
Models
Prompts
Metrics
Text Difficulty
Synonymity
Redundancy
Results and Discussion
Stack-Overflow
TSCC
Conclusion
...and 26 more sections

Figures (4)

Figure 1: Overview of our evaluation procedure. We evaluate generated texts from LLMs for user questions by comparing the correlation of text difficulty and redundancy. We also evaluate the synonymity between generated texts by LLMs and human answers.
Figure 2: Results on the Stack-Overflow dataset. Note that Table \ref{['tab:stack_overflow_normal']} and \ref{['tab:stack_overflow_normal_mean']} in Appendix include the detailed values.
Figure 3: Results on the TSCC dataset. Note that Table \ref{['tab:tscc']} and \ref{['tab:tscc_mean']} in Appendix include the detailed values.
Figure 4: Histgrams of input tokens (Stack-Overflow)

Do LLMs Implicitly Determine the Suitable Text Difficulty for Users?

TL;DR

Abstract

Do LLMs Implicitly Determine the Suitable Text Difficulty for Users?

Authors

TL;DR

Abstract

Table of Contents

Figures (4)