Table of Contents
Fetching ...

Exploring the Potential of Large Language Models for Estimating the Reading Comprehension Question Difficulty

Yoshee Jain, John Hollander, Amber He, Sunny Tang, Liang Zhang, John Sabatini

TL;DR

This study examines whether large language models (GPT-4o and o1) can scale reading-comprehension question difficulty estimation by comparing LLM-derived difficulty parameters to traditional 2PL-IRT estimates on the SARA dataset. By prompting LLMs with full items, answers, and aggregated performance under fixed conditions ($T=1$), the authors show that LLMs produce difficulty estimates that meaningfully align with empirical IRT parameters $a$ and $b$, while exhibiting different sensitivity to extreme item characteristics. The results indicate LLMs can serve as scalable, adaptive-education tools to complement psychometrics, enabling dynamic interactions with Adaptive Instructional Systems (AIS) for personalized reading assessments. The work also highlights limitations in replicating nuanced human reasoning and suggests avenues for hybrid models, multi-agent prompting, and external computation to enhance reproducibility and effectiveness in real-world educational settings.

Abstract

Reading comprehension is a key for individual success, yet the assessment of question difficulty remains challenging due to the extensive human annotation and large-scale testing required by traditional methods such as linguistic analysis and Item Response Theory (IRT). While these robust approaches provide valuable insights, their scalability is limited. There is potential for Large Language Models (LLMs) to automate question difficulty estimation; however, this area remains underexplored. Our study investigates the effectiveness of LLMs, specifically OpenAI's GPT-4o and o1, in estimating the difficulty of reading comprehension questions using the Study Aid and Reading Assessment (SARA) dataset. We evaluated both the accuracy of the models in answering comprehension questions and their ability to classify difficulty levels as defined by IRT. The results indicate that, while the models yield difficulty estimates that align meaningfully with derived IRT parameters, there are notable differences in their sensitivity to extreme item characteristics. These findings suggest that LLMs can serve as the scalable method for automated difficulty assessment, particularly in dynamic interactions between learners and Adaptive Instructional Systems (AIS), bridging the gap between traditional psychometric techniques and modern AIS for reading comprehension and paving the way for more adaptive and personalized educational assessments.

Exploring the Potential of Large Language Models for Estimating the Reading Comprehension Question Difficulty

TL;DR

This study examines whether large language models (GPT-4o and o1) can scale reading-comprehension question difficulty estimation by comparing LLM-derived difficulty parameters to traditional 2PL-IRT estimates on the SARA dataset. By prompting LLMs with full items, answers, and aggregated performance under fixed conditions (), the authors show that LLMs produce difficulty estimates that meaningfully align with empirical IRT parameters and , while exhibiting different sensitivity to extreme item characteristics. The results indicate LLMs can serve as scalable, adaptive-education tools to complement psychometrics, enabling dynamic interactions with Adaptive Instructional Systems (AIS) for personalized reading assessments. The work also highlights limitations in replicating nuanced human reasoning and suggests avenues for hybrid models, multi-agent prompting, and external computation to enhance reproducibility and effectiveness in real-world educational settings.

Abstract

Reading comprehension is a key for individual success, yet the assessment of question difficulty remains challenging due to the extensive human annotation and large-scale testing required by traditional methods such as linguistic analysis and Item Response Theory (IRT). While these robust approaches provide valuable insights, their scalability is limited. There is potential for Large Language Models (LLMs) to automate question difficulty estimation; however, this area remains underexplored. Our study investigates the effectiveness of LLMs, specifically OpenAI's GPT-4o and o1, in estimating the difficulty of reading comprehension questions using the Study Aid and Reading Assessment (SARA) dataset. We evaluated both the accuracy of the models in answering comprehension questions and their ability to classify difficulty levels as defined by IRT. The results indicate that, while the models yield difficulty estimates that align meaningfully with derived IRT parameters, there are notable differences in their sensitivity to extreme item characteristics. These findings suggest that LLMs can serve as the scalable method for automated difficulty assessment, particularly in dynamic interactions between learners and Adaptive Instructional Systems (AIS), bridging the gap between traditional psychometric techniques and modern AIS for reading comprehension and paving the way for more adaptive and personalized educational assessments.

Paper Structure

This paper contains 17 sections, 2 figures, 1 table.

Figures (2)

  • Figure 1: Average Accuracy of LLMs for the Different Subtests in the SARA Dataset. Note: WRDC = Word Recognition and Decoding; VOC = Vocabulary; MA = Morphology; SEN = Sentence Processing; EFFIC = Efficiency in Basic Reading.
  • Figure 2: Average Accuracy Comparison for Different Subtests in the SARA Dataset.