Table of Contents
Fetching ...

Correctness Coverage Evaluation for Medical Multiple-Choice Question Answering Based on the Enhanced Conformal Prediction Framework

Yusong Ke, Hongru Lin, Yuting Ruan, Junya Tang, Li Li

TL;DR

This work addresses the reliability of medical question-answering by large language models and introduces an enhanced Conformal Prediction (CP) framework that provides guaranteed marginal coverage of at least $1 - \alpha$ for medical MCQA prediction sets. The method defines a Non-Conformity Score tied to option-frequency estimates and incorporates a self-consistency mechanism, plus a monotone loss for task-specific risk control. Evaluations on MedMCQA, MedQA, and MMLU with four LLMs show strict control of the miscoverage rate while reducing the Average Prediction Set Size (APSS) as the risk level increases, highlighting a practical uncertainty metric for LLMs. The contributions include the first CP application to medical MCQA, demonstration of APSS as an uncertainty indicator, and extensive ablations that support robustness and applicability to high-stakes medical contexts.

Abstract

Large language models (LLMs) are increasingly adopted in medical question-answering (QA) scenarios. However, LLMs can generate hallucinations and nonfactual information, undermining their trustworthiness in high-stakes medical tasks. Conformal Prediction (CP) provides a statistically rigorous framework for marginal (average) coverage guarantees but has limited exploration in medical QA. This paper proposes an enhanced CP framework for medical multiple-choice question-answering (MCQA) tasks. By associating the non-conformance score with the frequency score of correct options and leveraging self-consistency, the framework addresses internal model opacity and incorporates a risk control strategy with a monotonic loss function. Evaluated on MedMCQA, MedQA, and MMLU datasets using four off-the-shelf LLMs, the proposed method meets specified error rate guarantees while reducing average prediction set size with increased risk level, offering a promising uncertainty evaluation metric for LLMs.

Correctness Coverage Evaluation for Medical Multiple-Choice Question Answering Based on the Enhanced Conformal Prediction Framework

TL;DR

This work addresses the reliability of medical question-answering by large language models and introduces an enhanced Conformal Prediction (CP) framework that provides guaranteed marginal coverage of at least for medical MCQA prediction sets. The method defines a Non-Conformity Score tied to option-frequency estimates and incorporates a self-consistency mechanism, plus a monotone loss for task-specific risk control. Evaluations on MedMCQA, MedQA, and MMLU with four LLMs show strict control of the miscoverage rate while reducing the Average Prediction Set Size (APSS) as the risk level increases, highlighting a practical uncertainty metric for LLMs. The contributions include the first CP application to medical MCQA, demonstration of APSS as an uncertainty indicator, and extensive ablations that support robustness and applicability to high-stakes medical contexts.

Abstract

Large language models (LLMs) are increasingly adopted in medical question-answering (QA) scenarios. However, LLMs can generate hallucinations and nonfactual information, undermining their trustworthiness in high-stakes medical tasks. Conformal Prediction (CP) provides a statistically rigorous framework for marginal (average) coverage guarantees but has limited exploration in medical QA. This paper proposes an enhanced CP framework for medical multiple-choice question-answering (MCQA) tasks. By associating the non-conformance score with the frequency score of correct options and leveraging self-consistency, the framework addresses internal model opacity and incorporates a risk control strategy with a monotonic loss function. Evaluated on MedMCQA, MedQA, and MMLU datasets using four off-the-shelf LLMs, the proposed method meets specified error rate guarantees while reducing average prediction set size with increased risk level, offering a promising uncertainty evaluation metric for LLMs.

Paper Structure

This paper contains 23 sections, 8 equations, 6 figures, 3 tables, 1 algorithm.

Figures (6)

  • Figure 1: Adaptation of Conformal Prediction to Medical MCQA Tasks.
  • Figure 2: Empirical Miscoverage Rate (EMR) for the MedMCQA dataset across different confidence levels ($\alpha$). The Llama models exhibit superior stability and lower error rates compared to the Qwen2.5 models.
  • Figure 3: EMR performance on the MedQA dataset, demonstrating the consistent reliability of Llama models over a range of error rates, with the Qwen2.5 models showing various variability.
  • Figure 4: EMR results for the MMLU dataset across categories including high school biology, anatomy, clinical knowledge, and college medicine. The Llama models maintain better stability and accuracy, particularly in clinical knowledge tasks.
  • Figure 5: Reliability measurement using frequency-based metrics on the MedMCQA dataset, showing variability in EMR and APSS across different $\alpha$ values.
  • ...and 1 more figures