Table of Contents
Fetching ...

Conformal Prediction with Large Language Models for Multi-Choice Question Answering

Bhawesh Kumar, Charlie Lu, Gauri Gupta, Anil Palepu, David Bellamy, Ramesh Raskar, Andrew Beam

TL;DR

The paper tackles the challenge of providing reliable uncertainty quantification for large language models in multiple-choice question answering. It adapts conformal prediction to generate prediction sets with distribution-free coverage guarantees using a calibration set, applied to LLaMA-13B on the MMLU MCQA subset. Key findings show that conformal uncertainty correlates with accuracy, enabling selective classification, and that exchangeability between calibration and test data is crucial for guarantees, with reasonable transfer within similar domains. The work offers a practical, training-free uncertainty framework for safe deployment of LLMs in high-stakes settings and releases code and data for reproducibility.

Abstract

As large language models continue to be widely developed, robust uncertainty quantification techniques will become crucial for their safe deployment in high-stakes scenarios. In this work, we explore how conformal prediction can be used to provide uncertainty quantification in language models for the specific task of multiple-choice question-answering. We find that the uncertainty estimates from conformal prediction are tightly correlated with prediction accuracy. This observation can be useful for downstream applications such as selective classification and filtering out low-quality predictions. We also investigate the exchangeability assumption required by conformal prediction to out-of-subject questions, which may be a more realistic scenario for many practical applications. Our work contributes towards more trustworthy and reliable usage of large language models in safety-critical situations, where robust guarantees of error rate are required.

Conformal Prediction with Large Language Models for Multi-Choice Question Answering

TL;DR

The paper tackles the challenge of providing reliable uncertainty quantification for large language models in multiple-choice question answering. It adapts conformal prediction to generate prediction sets with distribution-free coverage guarantees using a calibration set, applied to LLaMA-13B on the MMLU MCQA subset. Key findings show that conformal uncertainty correlates with accuracy, enabling selective classification, and that exchangeability between calibration and test data is crucial for guarantees, with reasonable transfer within similar domains. The work offers a practical, training-free uncertainty framework for safe deployment of LLMs in high-stakes settings and releases code and data for reproducibility.

Abstract

As large language models continue to be widely developed, robust uncertainty quantification techniques will become crucial for their safe deployment in high-stakes scenarios. In this work, we explore how conformal prediction can be used to provide uncertainty quantification in language models for the specific task of multiple-choice question-answering. We find that the uncertainty estimates from conformal prediction are tightly correlated with prediction accuracy. This observation can be useful for downstream applications such as selective classification and filtering out low-quality predictions. We also investigate the exchangeability assumption required by conformal prediction to out-of-subject questions, which may be a more realistic scenario for many practical applications. Our work contributes towards more trustworthy and reliable usage of large language models in safety-critical situations, where robust guarantees of error rate are required.
Paper Structure (14 sections, 4 equations, 9 figures, 1 table)

This paper contains 14 sections, 4 equations, 9 figures, 1 table.

Figures (9)

  • Figure 1: LLaMA MCQA accuracy is similar for GPT-4 generated questions and real MMLU questions across subjects. For most MMLU subjects, prediction accuracy using one-shot GPT-4 generated questions is similar to when actual MMLU questions are used in one-shot prompts. Results are averaged over ten randomly selected one-shot GPT-4 and MMLU prompts.
  • Figure 2: The accuracy distribution across subjects for ten prompts. We plot the distribution of accuracy for ten different one-shot prompts.
  • Figure 3: Desired coverage is achieved for all subjects. The red dashed line shows the desired coverage rate (specified at $\alpha=0.1$), which is guaranteed by conformal prediction to be with at least $1-\alpha$ percent of the time. The colors denote the three categories of questions.
  • Figure 4: Uncertainty quantification using prediction set size. In conformal prediction, a set of predictions is generated for each question. The size of this set indicates how uncertain the model is for a particular question. Larger set sizes denote greater uncertainty, and smaller set sizes denote less uncertainty. The colors denote the three categories of questions.
  • Figure 5: Top-1 accuracy stratified by prediction set size. For all subjects, we find a strong correlation between the prediction uncertainty (as measured by set size) and the top-1 accuracy of those predictions. Conformal prediction can be used for selective classification by filtering those predictions in which the model is highly uncertain.
  • ...and 4 more figures