Question Difficulty Ranking for Multiple-Choice Reading Comprehension

Vatsal Raina; Mark Gales

Question Difficulty Ranking for Multiple-Choice Reading Comprehension

Vatsal Raina, Mark Gales

TL;DR

The paper tackles automated ranking of MC reading-comprehension questions by difficulty, addressing the lack of large in-domain labeled data. It compares task-transfer methods (level classification trained on RACE++ and a derived RC-based difficulty) with zero-shot prompting (absolute and comparative) of instruction-tuned LLMs, evaluating on the CMCQRD dataset with RACE++ as training data. Zero-shot comparative prompting yields the strongest single signal (Spearman ~40.4%), outpacing absolute prompts and the task-transfer baselines, while combining level classification with comparative prompts further improves performance (≈43.7%). The approach offers a scalable, data-efficient route for curating MC items by difficulty, with practical implications for exam design and language assessment.

Abstract

Multiple-choice (MC) tests are an efficient method to assess English learners. It is useful for test creators to rank candidate MC questions by difficulty during exam curation. Typically, the difficulty is determined by having human test takers trial the questions in a pretesting stage. However, this is expensive and not scalable. Therefore, we explore automated approaches to rank MC questions by difficulty. However, there is limited data for explicit training of a system for difficulty scores. Hence, we compare task transfer and zero-shot approaches: task transfer adapts level classification and reading comprehension systems for difficulty ranking while zero-shot prompting of instruction finetuned language models contrasts absolute assessment against comparative. It is found that level classification transfers better than reading comprehension. Additionally, zero-shot comparative assessment is more effective at difficulty ranking than the absolute assessment and even the task transfer approaches at question difficulty ranking with a Spearman's correlation of 40.4%. Combining the systems is observed to further boost the correlation.

Question Difficulty Ranking for Multiple-Choice Reading Comprehension

TL;DR

Abstract

Question Difficulty Ranking for Multiple-Choice Reading Comprehension

Authors

TL;DR

Abstract

Table of Contents

Figures (3)