Exploring Automated Distractor Generation for Math Multiple-choice Questions via Large Language Models
Wanyong Feng, Jaewook Lee, Hunter McNichols, Alexander Scarlatos, Digory Smith, Simon Woodhead, Nancy Otero Ornelas, Andrew Lan
TL;DR
The paper tackles automated distractor generation for math MCQs using large language models, comparing in-context learning with kNN exemplars, chain-of-thought prompting, fine-tuning, rule-based baselines, and sampling-based methods on a real-world dataset. It finds that in-context learning with kNN yields the strongest alignment with human-authored distractors, while CoT and fine-tuning offer improvements over baselines but still lag behind kNN. Human evaluation shows LLM-generated distractors are mathematically valid but less effective at capturing common student misconceptions, highlighting a gap between mathematical correctness and pedagogical realism. The study suggests distribution-based evaluation and error-focused in-context cues as promising directions to better align generated distractors with real student errors, with implications for scalable, teacher-assisted MCQ authoring and assessment design.
Abstract
Multiple-choice questions (MCQs) are ubiquitous in almost all levels of education since they are easy to administer, grade, and are a reliable format in assessments and practices. One of the most important aspects of MCQs is the distractors, i.e., incorrect options that are designed to target common errors or misconceptions among real students. To date, the task of crafting high-quality distractors largely remains a labor and time-intensive process for teachers and learning content designers, which has limited scalability. In this work, we study the task of automated distractor generation in the domain of math MCQs and explore a wide variety of large language model (LLM)-based approaches, from in-context learning to fine-tuning. We conduct extensive experiments using a real-world math MCQ dataset and find that although LLMs can generate some mathematically valid distractors, they are less adept at anticipating common errors or misconceptions among real students.
