Correctness Coverage Evaluation for Medical Multiple-Choice Question Answering Based on the Enhanced Conformal Prediction Framework
Yusong Ke, Hongru Lin, Yuting Ruan, Junya Tang, Li Li
TL;DR
This work addresses the reliability of medical question-answering by large language models and introduces an enhanced Conformal Prediction (CP) framework that provides guaranteed marginal coverage of at least $1 - \alpha$ for medical MCQA prediction sets. The method defines a Non-Conformity Score tied to option-frequency estimates and incorporates a self-consistency mechanism, plus a monotone loss for task-specific risk control. Evaluations on MedMCQA, MedQA, and MMLU with four LLMs show strict control of the miscoverage rate while reducing the Average Prediction Set Size (APSS) as the risk level increases, highlighting a practical uncertainty metric for LLMs. The contributions include the first CP application to medical MCQA, demonstration of APSS as an uncertainty indicator, and extensive ablations that support robustness and applicability to high-stakes medical contexts.
Abstract
Large language models (LLMs) are increasingly adopted in medical question-answering (QA) scenarios. However, LLMs can generate hallucinations and nonfactual information, undermining their trustworthiness in high-stakes medical tasks. Conformal Prediction (CP) provides a statistically rigorous framework for marginal (average) coverage guarantees but has limited exploration in medical QA. This paper proposes an enhanced CP framework for medical multiple-choice question-answering (MCQA) tasks. By associating the non-conformance score with the frequency score of correct options and leveraging self-consistency, the framework addresses internal model opacity and incorporates a risk control strategy with a monotonic loss function. Evaluated on MedMCQA, MedQA, and MMLU datasets using four off-the-shelf LLMs, the proposed method meets specified error rate guarantees while reducing average prediction set size with increased risk level, offering a promising uncertainty evaluation metric for LLMs.
