MCQG-SRefine: Multiple Choice Question Generation and Evaluation with Iterative Self-Critique, Correction, and Comparison Feedback
Zonghai Yao, Aditya Parashar, Huixue Zhou, Won Seok Jang, Feiyun Ouyang, Zhichao Yang, Hong Yu
TL;DR
MCQG-SRefine presents an iterative, self-refining framework for converting medical cases into high-quality USMLE-style MCQs, augmented by expert-driven prompts and NBME-guided topic/test-point identification. Central to the approach are four components: initialization with retrieval-based few-shot guidance, question-answer feedback, structured critique, and corrective refinement, all driven by LLMs. It introduces an LLM-as-Judge metric to non-invasively evaluate quality and difficulty, achieving strong alignment with expert judgments and superior human preference over baselines. Empirical results show that when topics and key points are expert-curated, MCQG-SRefine yields more challenging questions and better overall quality than GPT-4 alone, indicating promise for scalable, domain-specific automated medical question generation. The work also discusses limitations, ethical considerations, and avenues for broader application and fairness-aware deployment in medical education tools.
Abstract
Automatic question generation (QG) is essential for AI and NLP, particularly in intelligent tutoring, dialogue systems, and fact verification. Generating multiple-choice questions (MCQG) for professional exams, like the United States Medical Licensing Examination (USMLE), is particularly challenging, requiring domain expertise and complex multi-hop reasoning for high-quality questions. However, current large language models (LLMs) like GPT-4 struggle with professional MCQG due to outdated knowledge, hallucination issues, and prompt sensitivity, resulting in unsatisfactory quality and difficulty. To address these challenges, we propose MCQG-SRefine, an LLM self-refine-based (Critique and Correction) framework for converting medical cases into high-quality USMLE-style questions. By integrating expert-driven prompt engineering with iterative self-critique and self-correction feedback, MCQG-SRefine significantly enhances human expert satisfaction regarding both the quality and difficulty of the questions. Furthermore, we introduce an LLM-as-Judge-based automatic metric to replace the complex and costly expert evaluation process, ensuring reliable and expert-aligned assessments.
