Table of Contents
Fetching ...

Can Large Language Models Self-Correct in Medical Question Answering? An Exploratory Study

Zaifu Zhan, Mengyuan Cui, Rui Zhang

Abstract

Large language models (LLMs) have achieved strong performance on medical question answering (medical QA), and chain-of-thought (CoT) prompting has further improved results by eliciting explicit intermediate reasoning; meanwhile, self-reflective (self-corrective) prompting has been widely claimed to enhance model reliability by prompting LLMs to critique and revise their own reasoning, yet its effectiveness in safety-critical medical settings remains unclear. In this work, we conduct an exploratory analysis of self-reflective reasoning for medical multiple-choice question answering: using GPT-4o and GPT-4o-mini, we compare standard CoT prompting with an iterative self-reflection loop and track how predictions evolve across reflection steps on three widely used medical QA benchmarks (MedQA, HeadQA, and PubMedQA). We analyze whether self-reflection leads to error correction, error persistence, or the introduction of new errors. Our results show that self-reflective prompting does not consistently improve accuracy and its impact is highly dataset- and model-dependent: it yields modest gains on MedQA but provides limited or negative benefits on HeadQA and PubMedQA, and increasing the number of reflection steps does not guarantee better performance. These findings highlight a gap between reasoning transparency and reasoning correctness, suggesting that self-reflective reasoning is better viewed as an analytical tool for understanding model behavior rather than a standalone solution for improving medical QA reliability.

Can Large Language Models Self-Correct in Medical Question Answering? An Exploratory Study

Abstract

Large language models (LLMs) have achieved strong performance on medical question answering (medical QA), and chain-of-thought (CoT) prompting has further improved results by eliciting explicit intermediate reasoning; meanwhile, self-reflective (self-corrective) prompting has been widely claimed to enhance model reliability by prompting LLMs to critique and revise their own reasoning, yet its effectiveness in safety-critical medical settings remains unclear. In this work, we conduct an exploratory analysis of self-reflective reasoning for medical multiple-choice question answering: using GPT-4o and GPT-4o-mini, we compare standard CoT prompting with an iterative self-reflection loop and track how predictions evolve across reflection steps on three widely used medical QA benchmarks (MedQA, HeadQA, and PubMedQA). We analyze whether self-reflection leads to error correction, error persistence, or the introduction of new errors. Our results show that self-reflective prompting does not consistently improve accuracy and its impact is highly dataset- and model-dependent: it yields modest gains on MedQA but provides limited or negative benefits on HeadQA and PubMedQA, and increasing the number of reflection steps does not guarantee better performance. These findings highlight a gap between reasoning transparency and reasoning correctness, suggesting that self-reflective reasoning is better viewed as an analytical tool for understanding model behavior rather than a standalone solution for improving medical QA reliability.

Paper Structure

This paper contains 26 sections, 6 figures, 1 table.

Figures (6)

  • Figure 1: Overview of our self-reflective reasoning framework for medical multiple-choice question answering. Top: standard chain-of-thought (CoT) prompting, where the LLM produces an explicit rationale and a final option (A/B/C/D). Bottom: our self-reflective reasoning loop, which first generates an initial rationale--answer pair and then iteratively reviews the previous reasoning for medical or logical errors. If the reviewer is satisfied, the current answer is returned; otherwise, the model revises its rationale and updates the answer, repeating until convergence or a maximum number of reflection steps.
  • Figure 2: Example prompt used for the MedQA dataset. The prompt instructs the model to generate a step-by-step clinical rationale followed by a final multiple-choice answer, and is extended with a structured self-reflection instruction for iterative reasoning analysis.
  • Figure 3: Comparison of Chain-of-Thought reasoning and self-reflective reasoning accuracy on three medical QA datasets (HeadQA, MedQA, and PubMedQA). Results are reported for ChatGPT-4o and ChatGPT-4o-mini.
  • Figure 4: Accuracy as a function of the number of cumulative self-reflection steps for ChatGPT-4o and ChatGPT-4o-mini across three datasets. Each curve shows how performance evolves as additional reflection steps are applied.
  • Figure 5: Distribution of the number of self-reflection steps used by the models across all evaluation instances. Each pie chart shows the percentage of samples requiring a given number of reflection steps (0–10) for a specific dataset–model pair. Colors are consistent across charts to facilitate comparison.
  • ...and 1 more figures