Estimating Exam Item Difficulty with LLMs: A Benchmark on Brazil's ENEM Corpus
Thiago Brant, Julien Kühn, Jun Pang
TL;DR
This study benchmarks ten LLMs on ENEM, comparing their item-difficulty estimates to official IRT parameters for 1,031 questions across four subjects. It reveals that while models can recover useful rank signals, they suffer from systematic biases in absolute difficulty and are sensitive to prompts and modality, especially for visual content. The authors propose an evaluation-before-generation pipeline with light post-hoc calibration and fairness diagnostics, showing that LLMs are better suited as screening tools rather than authoritative difficulty estimators. The work highlights the need for calibration, modality-preserving input handling, and guardrails around context cues to prevent biased or inequitable adaptive assessment. It lays a concrete foundation for trustworthy integration of LLMs into assessment workflows, emphasizing transparent evaluation and continuous auditing prior to any generation of exam content.
Abstract
As Large Language Models (LLMs) are increasingly deployed to generate educational content, a critical safety question arises: can these models reliably estimate the difficulty of the questions they produce? Using Brazil's high-stakes ENEM exam as a testbed, we benchmark ten proprietary and open-weight LLMs against official Item Response Theory (IRT) parameters for 1,031 questions. We evaluate performance along three axes: absolute calibration, rank fidelity, and context sensitivity across learner backgrounds. Our results reveal a significant trade-off: while the best models achieve moderate rank correlation, they systematically underestimate difficulty and degrade significantly on multimodal items. Crucially, we find that models exhibit limited and inconsistent plasticity when prompted with student demographic cues, suggesting they are not yet ready for context-adaptive personalization. We conclude that LLMs function best as calibrated screeners rather than authoritative oracles, supporting an "evaluation-before-generation" pipeline for responsible assessment design.
