Table of Contents
Fetching ...

Can LLMs Estimate Cognitive Complexity of Reading Comprehension Items?

Seonjeong Hwang, Hyounghun Kim, Gary Geunbae Lee

TL;DR

This work investigates whether LLMs can estimate cognitive complexity in RC items along two cognitively grounded dimensions, Evidence Scope and Transformation Level. It introduces ReCo, a benchmark of 776 TFNG RC items annotated by experts on these dimensions, enabling rigorous evaluation. Eight instruction-tuned LLMs are tested across prompting strategies and decoding modes, showing that LLMs can approximate cognitive complexity for ES and 3-level TL with competitive performance in open-source models, though gaps remain in metacognitive awareness and full feature identification. The findings suggest LLM-assisted prior difficulty analysis is feasible and scalable, while also highlighting limitations that motivate further research into broader item types and improved prompting for cognitive insight.

Abstract

Estimating the cognitive complexity of reading comprehension (RC) items is crucial for assessing item difficulty before it is administered to learners. Unlike syntactic and semantic features, such as passage length or semantic similarity between options, cognitive features that arise during answer reasoning are not readily extractable using existing NLP tools and have traditionally relied on human annotation. In this study, we examine whether large language models (LLMs) can estimate the cognitive complexity of RC items by focusing on two dimensions-Evidence Scope and Transformation Level-that indicate the degree of cognitive burden involved in reasoning about the answer. Our experimental results demonstrate that LLMs can approximate the cognitive complexity of items, indicating their potential as tools for prior difficulty analysis. Further analysis reveals a gap between LLMs' reasoning ability and their metacognitive awareness: even when they produce correct answers, they sometimes fail to correctly identify the features underlying their own reasoning process.

Can LLMs Estimate Cognitive Complexity of Reading Comprehension Items?

TL;DR

This work investigates whether LLMs can estimate cognitive complexity in RC items along two cognitively grounded dimensions, Evidence Scope and Transformation Level. It introduces ReCo, a benchmark of 776 TFNG RC items annotated by experts on these dimensions, enabling rigorous evaluation. Eight instruction-tuned LLMs are tested across prompting strategies and decoding modes, showing that LLMs can approximate cognitive complexity for ES and 3-level TL with competitive performance in open-source models, though gaps remain in metacognitive awareness and full feature identification. The findings suggest LLM-assisted prior difficulty analysis is feasible and scalable, while also highlighting limitations that motivate further research into broader item types and improved prompting for cognitive insight.

Abstract

Estimating the cognitive complexity of reading comprehension (RC) items is crucial for assessing item difficulty before it is administered to learners. Unlike syntactic and semantic features, such as passage length or semantic similarity between options, cognitive features that arise during answer reasoning are not readily extractable using existing NLP tools and have traditionally relied on human annotation. In this study, we examine whether large language models (LLMs) can estimate the cognitive complexity of RC items by focusing on two dimensions-Evidence Scope and Transformation Level-that indicate the degree of cognitive burden involved in reasoning about the answer. Our experimental results demonstrate that LLMs can approximate the cognitive complexity of items, indicating their potential as tools for prior difficulty analysis. Further analysis reveals a gap between LLMs' reasoning ability and their metacognitive awareness: even when they produce correct answers, they sometimes fail to correctly identify the features underlying their own reasoning process.

Paper Structure

This paper contains 24 sections, 8 figures, 4 tables.

Figures (8)

  • Figure 1: Examples of RC items that require determining the factuality of a statement. Each item is annotated along two cognitively grounded dimensions (Evidence Scope and Transformation Level) with corresponding supporting sentences highlighted from the passage.
  • Figure 2: Distribution of the number of evidence sentences selected by LLMs and humans.
  • Figure 3: Distribution of TL labels predicted by LLMs for single-sentence evidence items.
  • Figure 4: Representative error cases with GPT-4o’s responses.
  • Figure 5: Inter-annotator agreement across labeling dimensions. Agreement ratios are shown for (top) combined labels of two dimensions. "At Least Two" indicates majority agreement among annotators, while "All" requires unanimous agreement. Pairwise agreements between annotators (A1, A2, A3) are also reported. For TL agreement, multi-evidence items labeled as word matching or paraphrasing were considered equivalent to transformed word matching and transformed paraphrasing, respectively.
  • ...and 3 more figures