Table of Contents
Fetching ...

MCQG-SRefine: Multiple Choice Question Generation and Evaluation with Iterative Self-Critique, Correction, and Comparison Feedback

Zonghai Yao, Aditya Parashar, Huixue Zhou, Won Seok Jang, Feiyun Ouyang, Zhichao Yang, Hong Yu

TL;DR

MCQG-SRefine presents an iterative, self-refining framework for converting medical cases into high-quality USMLE-style MCQs, augmented by expert-driven prompts and NBME-guided topic/test-point identification. Central to the approach are four components: initialization with retrieval-based few-shot guidance, question-answer feedback, structured critique, and corrective refinement, all driven by LLMs. It introduces an LLM-as-Judge metric to non-invasively evaluate quality and difficulty, achieving strong alignment with expert judgments and superior human preference over baselines. Empirical results show that when topics and key points are expert-curated, MCQG-SRefine yields more challenging questions and better overall quality than GPT-4 alone, indicating promise for scalable, domain-specific automated medical question generation. The work also discusses limitations, ethical considerations, and avenues for broader application and fairness-aware deployment in medical education tools.

Abstract

Automatic question generation (QG) is essential for AI and NLP, particularly in intelligent tutoring, dialogue systems, and fact verification. Generating multiple-choice questions (MCQG) for professional exams, like the United States Medical Licensing Examination (USMLE), is particularly challenging, requiring domain expertise and complex multi-hop reasoning for high-quality questions. However, current large language models (LLMs) like GPT-4 struggle with professional MCQG due to outdated knowledge, hallucination issues, and prompt sensitivity, resulting in unsatisfactory quality and difficulty. To address these challenges, we propose MCQG-SRefine, an LLM self-refine-based (Critique and Correction) framework for converting medical cases into high-quality USMLE-style questions. By integrating expert-driven prompt engineering with iterative self-critique and self-correction feedback, MCQG-SRefine significantly enhances human expert satisfaction regarding both the quality and difficulty of the questions. Furthermore, we introduce an LLM-as-Judge-based automatic metric to replace the complex and costly expert evaluation process, ensuring reliable and expert-aligned assessments.

MCQG-SRefine: Multiple Choice Question Generation and Evaluation with Iterative Self-Critique, Correction, and Comparison Feedback

TL;DR

MCQG-SRefine presents an iterative, self-refining framework for converting medical cases into high-quality USMLE-style MCQs, augmented by expert-driven prompts and NBME-guided topic/test-point identification. Central to the approach are four components: initialization with retrieval-based few-shot guidance, question-answer feedback, structured critique, and corrective refinement, all driven by LLMs. It introduces an LLM-as-Judge metric to non-invasively evaluate quality and difficulty, achieving strong alignment with expert judgments and superior human preference over baselines. Empirical results show that when topics and key points are expert-curated, MCQG-SRefine yields more challenging questions and better overall quality than GPT-4 alone, indicating promise for scalable, domain-specific automated medical question generation. The work also discusses limitations, ethical considerations, and avenues for broader application and fairness-aware deployment in medical education tools.

Abstract

Automatic question generation (QG) is essential for AI and NLP, particularly in intelligent tutoring, dialogue systems, and fact verification. Generating multiple-choice questions (MCQG) for professional exams, like the United States Medical Licensing Examination (USMLE), is particularly challenging, requiring domain expertise and complex multi-hop reasoning for high-quality questions. However, current large language models (LLMs) like GPT-4 struggle with professional MCQG due to outdated knowledge, hallucination issues, and prompt sensitivity, resulting in unsatisfactory quality and difficulty. To address these challenges, we propose MCQG-SRefine, an LLM self-refine-based (Critique and Correction) framework for converting medical cases into high-quality USMLE-style questions. By integrating expert-driven prompt engineering with iterative self-critique and self-correction feedback, MCQG-SRefine significantly enhances human expert satisfaction regarding both the quality and difficulty of the questions. Furthermore, we introduce an LLM-as-Judge-based automatic metric to replace the complex and costly expert evaluation process, ensuring reliable and expert-aligned assessments.

Paper Structure

This paper contains 33 sections, 11 figures, 19 tables, 2 algorithms.

Figures (11)

  • Figure 1: USMLE MCQ generated by GPT-4 and MCQG-SRefine. The GPT-4 question contains several errors and inconsistencies, such as extraneous information, a distractor option format mismatch, mentioning symptoms instead of conditions, and a context that contains the answer. The MCQG-SRefine addresses these issues, resulting in a higher quality question that aligns the context, question, and answer options more coherently. Irrelevant details are removed, the question focuses on the key clinical condition of sepsis, distractor options are presented in a consistent format, and the context no longer gives away the answer.
  • Figure 2: The framework for generating USMLE-style questions involves four main steps, as illustrated in the figure. First, the initialization generates the context, question, answer, and distractor options using retrieval and generation models. The generation model then answers the generated question along with a reasoning. Next, the feedback step evaluates the generated components on various rubrics and generates textual feedback and scores, stopping if feedback scores exceed a threshold. Finally, the refine step iterates by using the feedback to improve the components before cycling back to the answer step.
  • Figure 3: The quality expert preference for the GPT-4 and the GPT-4 + MCQG-SRefine question. The data is divided into Human and Machine based on how the topic $t$ and key points $k$ were generated. We only put the final Expert X preferences here, but we provide more results in the Appendix Table \ref{['tab:evaluations']}. The percentage agreement between experts is 87.5% (Human<$t$, $k$>: 90%, Machine<$t$, $k$>: 85%). The Cohen's kappa between experts is 0.66722 (Human<$t$, $k$>: 0.75, Machine<$t$, $k$>: 0.57), indicating substantial reliability.
  • Figure 4: The difficulty expert evaluation for the GPT-4 generated and the GPT-4 + MCQG-SRefine questions.
  • Figure 5: LLM-as-Judge (Rating) results for different components (e.g., Context, Question, Correct Answer, Distractor, Reasoning) and the final score.
  • ...and 6 more figures