Estimating Exam Item Difficulty with LLMs: A Benchmark on Brazil's ENEM Corpus

Thiago Brant; Julien Kühn; Jun Pang

Estimating Exam Item Difficulty with LLMs: A Benchmark on Brazil's ENEM Corpus

Thiago Brant, Julien Kühn, Jun Pang

TL;DR

This study benchmarks ten LLMs on ENEM, comparing their item-difficulty estimates to official IRT parameters for 1,031 questions across four subjects. It reveals that while models can recover useful rank signals, they suffer from systematic biases in absolute difficulty and are sensitive to prompts and modality, especially for visual content. The authors propose an evaluation-before-generation pipeline with light post-hoc calibration and fairness diagnostics, showing that LLMs are better suited as screening tools rather than authoritative difficulty estimators. The work highlights the need for calibration, modality-preserving input handling, and guardrails around context cues to prevent biased or inequitable adaptive assessment. It lays a concrete foundation for trustworthy integration of LLMs into assessment workflows, emphasizing transparent evaluation and continuous auditing prior to any generation of exam content.

Abstract

As Large Language Models (LLMs) are increasingly deployed to generate educational content, a critical safety question arises: can these models reliably estimate the difficulty of the questions they produce? Using Brazil's high-stakes ENEM exam as a testbed, we benchmark ten proprietary and open-weight LLMs against official Item Response Theory (IRT) parameters for 1,031 questions. We evaluate performance along three axes: absolute calibration, rank fidelity, and context sensitivity across learner backgrounds. Our results reveal a significant trade-off: while the best models achieve moderate rank correlation, they systematically underestimate difficulty and degrade significantly on multimodal items. Crucially, we find that models exhibit limited and inconsistent plasticity when prompted with student demographic cues, suggesting they are not yet ready for context-adaptive personalization. We conclude that LLMs function best as calibrated screeners rather than authoritative oracles, supporting an "evaluation-before-generation" pipeline for responsible assessment design.

Estimating Exam Item Difficulty with LLMs: A Benchmark on Brazil's ENEM Corpus

TL;DR

Abstract

Paper Structure (53 sections, 11 equations, 19 figures, 18 tables)

This paper contains 53 sections, 11 equations, 19 figures, 18 tables.

Introduction
Related Work
LLMs in Educational Assessment
Implications.
Psychometric Models and Difficulty Estimation
Implications.
Prompt Engineering and LLM Evaluation
Implications.
Geographic, Cultural, and Nationality Bias in LLMs
Implications.
Dataset and Data Processing
Methods
Prompt Families
Model Roster
Prediction Pipeline
...and 38 more sections

Figures (19)

Figure 1: Performance across all model–prompt pairs.
Figure 2: Trade-off between RMSE and Spearman $\rho$ for each model's best prompt.
Figure 3: Mathematics: RMSE heatmap by prompt and model.
Figure 4: Mathematics: Spearman $\rho$ heatmap by prompt and model.
Figure 5: Languages: RMSE heatmap by prompt and model.
...and 14 more figures

Estimating Exam Item Difficulty with LLMs: A Benchmark on Brazil's ENEM Corpus

TL;DR

Abstract

Estimating Exam Item Difficulty with LLMs: A Benchmark on Brazil's ENEM Corpus

Authors

TL;DR

Abstract

Table of Contents

Figures (19)