Table of Contents
Fetching ...

Can Model Uncertainty Function as a Proxy for Multiple-Choice Question Item Difficulty?

Leonidas Zotos, Hedderik van Rijn, Malvina Nissim

TL;DR

The paper investigates whether large language model (LLM) uncertainty can serve as a proxy for MCQ item difficulty. It analyzes a fine-grained Biopsychology dataset of 451 MCQs using three uncertainty signals—1st Token Probability, Choice Order Sensitivity, and Choice Entropy—across multiple decoder-only LLMs, evaluating correlations with student response distributions at both question and choice levels. Overall, correlations are weak but detectable and vary by question type, model, and whether answers are correct, with stronger signals for certain formats like 'fill the gap' and 'fill two gaps'. The findings suggest that model uncertainty can complement traditional difficulty estimation, motivating future work on prompting strategies, instruction perturbations, and broader generalisability to different domains and datasets.

Abstract

Estimating the difficulty of multiple-choice questions would be great help for educators who must spend substantial time creating and piloting stimuli for their tests, and for learners who want to practice. Supervised approaches to difficulty estimation have yielded to date mixed results. In this contribution we leverage an aspect of generative large models which might be seen as a weakness when answering questions, namely their uncertainty, and exploit it towards exploring correlations between two different metrics of uncertainty, and the actual student response distribution. While we observe some present but weak correlations, we also discover that the models' behaviour is different in the case of correct vs wrong answers, and that correlations differ substantially according to the different question types which are included in our fine-grained, previously unused dataset of 451 questions from a Biopsychology course. In discussing our findings, we also suggest potential avenues to further leverage model uncertainty as an additional proxy for item difficulty.

Can Model Uncertainty Function as a Proxy for Multiple-Choice Question Item Difficulty?

TL;DR

The paper investigates whether large language model (LLM) uncertainty can serve as a proxy for MCQ item difficulty. It analyzes a fine-grained Biopsychology dataset of 451 MCQs using three uncertainty signals—1st Token Probability, Choice Order Sensitivity, and Choice Entropy—across multiple decoder-only LLMs, evaluating correlations with student response distributions at both question and choice levels. Overall, correlations are weak but detectable and vary by question type, model, and whether answers are correct, with stronger signals for certain formats like 'fill the gap' and 'fill two gaps'. The findings suggest that model uncertainty can complement traditional difficulty estimation, motivating future work on prompting strategies, instruction perturbations, and broader generalisability to different domains and datasets.

Abstract

Estimating the difficulty of multiple-choice questions would be great help for educators who must spend substantial time creating and piloting stimuli for their tests, and for learners who want to practice. Supervised approaches to difficulty estimation have yielded to date mixed results. In this contribution we leverage an aspect of generative large models which might be seen as a weakness when answering questions, namely their uncertainty, and exploit it towards exploring correlations between two different metrics of uncertainty, and the actual student response distribution. While we observe some present but weak correlations, we also discover that the models' behaviour is different in the case of correct vs wrong answers, and that correlations differ substantially according to the different question types which are included in our fine-grained, previously unused dataset of 451 questions from a Biopsychology course. In discussing our findings, we also suggest potential avenues to further leverage model uncertainty as an additional proxy for item difficulty.
Paper Structure (19 sections, 1 equation, 7 figures, 6 tables)

This paper contains 19 sections, 1 equation, 7 figures, 6 tables.

Figures (7)

  • Figure 1: Exploring the potential of model uncertainty as a proxy for student selection rate in Multiple-Choice Questions. Model uncertainty is operationalised using 1st Token Probability and Choice Order Sensitivity.
  • Figure 2: Average selection rate per choice. Distractors are ordered by their selection rate.
  • Figure 3: Spearman Correlation (implemented using the SciPy package 2020SciPy-NMeth) between student and model choice entropy. Asterisks signify significant correlation, using a significance level of $\alpha = 0.05$.
  • Figure 4: Average Chi-Squared value between the student proportions and model uncertainties distributions. All Chi-Squared Values indicate a statistically significant difference between the distributions, using a significance level of $a=0.05$, indicating that the student and model metrics originate from significantly different distributions.
  • Figure 5: Spearman Correlation between model uncertainty metrics and student selection rates, per choice, using the complete dataset. Asterisks signify that the correlation is significant, using a significance level of $\alpha = 0.05$.
  • ...and 2 more figures