Table of Contents
Fetching ...

Reasoning and Sampling-Augmented MCQ Difficulty Prediction via LLMs

Wanyong Feng, Peter Tran, Stephen Sireci, Andrew Lan

TL;DR

This work addresses the challenge of predicting MCQ difficulty by jointly modeling cognitive steps to reach the correct option and the plausibility of distractors. It proposes a two-stage framework that (i) augments each option with LLM-generated reasoning or feedback, (ii) samples diverse student knowledge profiles from an IRT-inspired distribution to predict option-level selection likelihoods, and (iii) uses these likelihoods to predict MCQ difficulty with a KL-divergence regularization to ground truth. The methodology comprises four modules—Reasoning/Feedback Generation, Feature Extraction, Student Interaction, and Difficulty Prediction—trained end-to-end, and validated on two real-world math MCQ datasets, achieving up to a 28.3% reduction in $MSE$ and a 34.6% improvement in $R^2$ over strong baselines. Qualitative analysis and visualizations suggest that reasoning augmentation and multi-profile sampling yield more accurate difficulty estimates and better preserve relative difficulty rankings. The approach offers practical benefits for adaptive testing and automatic question generation, with potential to generalize to other domains beyond mathematics.

Abstract

The difficulty of multiple-choice questions (MCQs) is a crucial factor for educational assessments. Predicting MCQ difficulty is challenging since it requires understanding both the complexity of reaching the correct option and the plausibility of distractors, i.e., incorrect options. In this paper, we propose a novel, two-stage method to predict the difficulty of MCQs. First, to better estimate the complexity of each MCQ, we use large language models (LLMs) to augment the reasoning steps required to reach each option. We use not just the MCQ itself but also these reasoning steps as input to predict the difficulty. Second, to capture the plausibility of distractors, we sample knowledge levels from a distribution to account for variation among students responding to the MCQ. This setup, inspired by item response theory (IRT), enable us to estimate the likelihood of students selecting each (both correct and incorrect) option. We align these predictions with their ground truth values, using a Kullback-Leibler (KL) divergence-based regularization objective, and use estimated likelihoods to predict MCQ difficulty. We evaluate our method on two real-world \emph{math} MCQ and response datasets with ground truth difficulty values estimated using IRT. Experimental results show that our method outperforms all baselines, up to a 28.3\% reduction in mean squared error and a 34.6\% improvement in the coefficient of determination. We also qualitatively discuss how our novel method results in higher accuracy in predicting MCQ difficulty.

Reasoning and Sampling-Augmented MCQ Difficulty Prediction via LLMs

TL;DR

This work addresses the challenge of predicting MCQ difficulty by jointly modeling cognitive steps to reach the correct option and the plausibility of distractors. It proposes a two-stage framework that (i) augments each option with LLM-generated reasoning or feedback, (ii) samples diverse student knowledge profiles from an IRT-inspired distribution to predict option-level selection likelihoods, and (iii) uses these likelihoods to predict MCQ difficulty with a KL-divergence regularization to ground truth. The methodology comprises four modules—Reasoning/Feedback Generation, Feature Extraction, Student Interaction, and Difficulty Prediction—trained end-to-end, and validated on two real-world math MCQ datasets, achieving up to a 28.3% reduction in and a 34.6% improvement in over strong baselines. Qualitative analysis and visualizations suggest that reasoning augmentation and multi-profile sampling yield more accurate difficulty estimates and better preserve relative difficulty rankings. The approach offers practical benefits for adaptive testing and automatic question generation, with potential to generalize to other domains beyond mathematics.

Abstract

The difficulty of multiple-choice questions (MCQs) is a crucial factor for educational assessments. Predicting MCQ difficulty is challenging since it requires understanding both the complexity of reaching the correct option and the plausibility of distractors, i.e., incorrect options. In this paper, we propose a novel, two-stage method to predict the difficulty of MCQs. First, to better estimate the complexity of each MCQ, we use large language models (LLMs) to augment the reasoning steps required to reach each option. We use not just the MCQ itself but also these reasoning steps as input to predict the difficulty. Second, to capture the plausibility of distractors, we sample knowledge levels from a distribution to account for variation among students responding to the MCQ. This setup, inspired by item response theory (IRT), enable us to estimate the likelihood of students selecting each (both correct and incorrect) option. We align these predictions with their ground truth values, using a Kullback-Leibler (KL) divergence-based regularization objective, and use estimated likelihoods to predict MCQ difficulty. We evaluate our method on two real-world \emph{math} MCQ and response datasets with ground truth difficulty values estimated using IRT. Experimental results show that our method outperforms all baselines, up to a 28.3\% reduction in mean squared error and a 34.6\% improvement in the coefficient of determination. We also qualitatively discuss how our novel method results in higher accuracy in predicting MCQ difficulty.

Paper Structure

This paper contains 15 sections, 9 equations, 2 figures, 2 tables.

Figures (2)

  • Figure 1: Overview of our MCQ difficulty prediction pipeline.
  • Figure 2: Sampled student knowledge (left) and question difficulty (right).