Table of Contents
Fetching ...

Can LLMs Estimate Student Struggles? Human-AI Difficulty Alignment with Proficiency Simulation for Item Difficulty Prediction

Ming Li, Han Chen, Yunze Xiao, Jian Chen, Hong Jiao, Tianyi Zhou

TL;DR

This paper asks whether Large Language Models can meaningfully estimate student difficulty (IDP) without task-specific data. It introduces a dual observer/actor framework, rigorous evaluation via Spearman correlations and Rasch IRT, and a proficiency-simulation protocol across four domains with 20+ models. The findings reveal systematic misalignment: scaling and proficiency prompts fail to align AI difficulty estimates with human struggles, and a cohesive machine consensus often diverges from reality. Additionally, models exhibit metacognitive blindness, showing little introspection about their own limitations. The work highlights the gap between solving a problem and understanding its cognitive difficulty, suggesting that new grounding methods are needed for reliable automated difficulty prediction in education.

Abstract

Accurate estimation of item (question or task) difficulty is critical for educational assessment but suffers from the cold start problem. While Large Language Models demonstrate superhuman problem-solving capabilities, it remains an open question whether they can perceive the cognitive struggles of human learners. In this work, we present a large-scale empirical analysis of Human-AI Difficulty Alignment for over 20 models across diverse domains such as medical knowledge and mathematical reasoning. Our findings reveal a systematic misalignment where scaling up model size is not reliably helpful; instead of aligning with humans, models converge toward a shared machine consensus. We observe that high performance often impedes accurate difficulty estimation, as models struggle to simulate the capability limitations of students even when being explicitly prompted to adopt specific proficiency levels. Furthermore, we identify a critical lack of introspection, as models fail to predict their own limitations. These results suggest that general problem-solving capability does not imply an understanding of human cognitive struggles, highlighting the challenge of using current models for automated difficulty prediction.

Can LLMs Estimate Student Struggles? Human-AI Difficulty Alignment with Proficiency Simulation for Item Difficulty Prediction

TL;DR

This paper asks whether Large Language Models can meaningfully estimate student difficulty (IDP) without task-specific data. It introduces a dual observer/actor framework, rigorous evaluation via Spearman correlations and Rasch IRT, and a proficiency-simulation protocol across four domains with 20+ models. The findings reveal systematic misalignment: scaling and proficiency prompts fail to align AI difficulty estimates with human struggles, and a cohesive machine consensus often diverges from reality. Additionally, models exhibit metacognitive blindness, showing little introspection about their own limitations. The work highlights the gap between solving a problem and understanding its cognitive difficulty, suggesting that new grounding methods are needed for reliable automated difficulty prediction in education.

Abstract

Accurate estimation of item (question or task) difficulty is critical for educational assessment but suffers from the cold start problem. While Large Language Models demonstrate superhuman problem-solving capabilities, it remains an open question whether they can perceive the cognitive struggles of human learners. In this work, we present a large-scale empirical analysis of Human-AI Difficulty Alignment for over 20 models across diverse domains such as medical knowledge and mathematical reasoning. Our findings reveal a systematic misalignment where scaling up model size is not reliably helpful; instead of aligning with humans, models converge toward a shared machine consensus. We observe that high performance often impedes accurate difficulty estimation, as models struggle to simulate the capability limitations of students even when being explicitly prompted to adopt specific proficiency levels. Furthermore, we identify a critical lack of introspection, as models fail to predict their own limitations. These results suggest that general problem-solving capability does not imply an understanding of human cognitive struggles, highlighting the challenge of using current models for automated difficulty prediction.

Paper Structure

This paper contains 31 sections, 5 equations, 14 figures, 5 tables.

Figures (14)

  • Figure 1: The violin plot of the difficulty prediction distributions of several representative models. Current advanced models exhibit severe distribution shift.
  • Figure 2: The Spearman correlation trends when greedily ensembling the predictions of the top-$K$ models. The curve indicates the upper bound of the ensemble performance, which is still weak.
  • Figure 3: Heatmap showing the correlation change when applying specific personas compared to the baseline. The impact of individual personas is highly inconsistent and noisy.
  • Figure 4: The consensus heatmap of the spearman correlation between the models on the USMLE dataset.
  • Figure 5: The consensus heatmap of the spearman correlation between the models on the CMCQRD dataset.
  • ...and 9 more figures