Table of Contents
Fetching ...

Conformal Information Pursuit for Interactively Guiding Large Language Models

Kwan Ho Ryan Chan, Yuyan Ge, Edgar Dobriban, Hamed Hassani, René Vidal

TL;DR

This work tackles interactive prediction with large language models by replacing entropy-based uncertainty with conformal prediction-based uncertainty estimates to guide sequential querying. It introduces Conformal Information Pursuit (C-IP), which builds prediction sets with marginal coverage guarantees and uses their expected size to bound conditional entropy, enabling distribution-free, robust query selection. The approach is validated on the 20 Questions game and extended to interactive medical question answering (MediQ), where it achieves competitive predictive performance and interpretable query chains. The results highlight the practical value of conformal prediction for uncertainty quantification in interactive LLM workflows and point to future directions for theoretical guarantees and risk-aware control in sequential decision making.

Abstract

A significant use case of instruction-finetuned Large Language Models (LLMs) is to solve question-answering tasks interactively. In this setting, an LLM agent is tasked with making a prediction by sequentially querying relevant information from the user, as opposed to a single-turn conversation. This paper explores sequential querying strategies that aim to minimize the expected number of queries. One such strategy is Information Pursuit (IP), a greedy algorithm that at each iteration selects the query that maximizes information gain or equivalently minimizes uncertainty. However, obtaining accurate estimates of mutual information or conditional entropy for LLMs is very difficult in practice due to over- or under-confident LLM proba- bilities, which leads to suboptimal query selection and predictive performance. To better estimate the uncertainty at each iteration, we propose Conformal Information Pursuit (C-IP), an alternative approach to sequential information gain based on conformal prediction sets. More specifically, C-IP leverages a relationship between prediction sets and conditional entropy at each iteration to estimate uncertainty based on the average size of conformal prediction sets. In contrast to conditional entropy, we find that conformal prediction sets are a distribution-free and robust method of measuring uncertainty. Experiments with 20 Questions show that C-IP obtains better predictive performance and shorter query-answer chains compared to previous approaches to IP and uncertainty-based chain-of-thought methods. Furthermore, extending to an interactive medical setting between a doctor and a patient on the MediQ dataset, C-IP achieves competitive performance with direct single-turn prediction while offering greater interpretability.

Conformal Information Pursuit for Interactively Guiding Large Language Models

TL;DR

This work tackles interactive prediction with large language models by replacing entropy-based uncertainty with conformal prediction-based uncertainty estimates to guide sequential querying. It introduces Conformal Information Pursuit (C-IP), which builds prediction sets with marginal coverage guarantees and uses their expected size to bound conditional entropy, enabling distribution-free, robust query selection. The approach is validated on the 20 Questions game and extended to interactive medical question answering (MediQ), where it achieves competitive predictive performance and interpretable query chains. The results highlight the practical value of conformal prediction for uncertainty quantification in interactive LLM workflows and point to future directions for theoretical guarantees and risk-aware control in sequential decision making.

Abstract

A significant use case of instruction-finetuned Large Language Models (LLMs) is to solve question-answering tasks interactively. In this setting, an LLM agent is tasked with making a prediction by sequentially querying relevant information from the user, as opposed to a single-turn conversation. This paper explores sequential querying strategies that aim to minimize the expected number of queries. One such strategy is Information Pursuit (IP), a greedy algorithm that at each iteration selects the query that maximizes information gain or equivalently minimizes uncertainty. However, obtaining accurate estimates of mutual information or conditional entropy for LLMs is very difficult in practice due to over- or under-confident LLM proba- bilities, which leads to suboptimal query selection and predictive performance. To better estimate the uncertainty at each iteration, we propose Conformal Information Pursuit (C-IP), an alternative approach to sequential information gain based on conformal prediction sets. More specifically, C-IP leverages a relationship between prediction sets and conditional entropy at each iteration to estimate uncertainty based on the average size of conformal prediction sets. In contrast to conditional entropy, we find that conformal prediction sets are a distribution-free and robust method of measuring uncertainty. Experiments with 20 Questions show that C-IP obtains better predictive performance and shorter query-answer chains compared to previous approaches to IP and uncertainty-based chain-of-thought methods. Furthermore, extending to an interactive medical setting between a doctor and a patient on the MediQ dataset, C-IP achieves competitive performance with direct single-turn prediction while offering greater interpretability.

Paper Structure

This paper contains 48 sections, 1 theorem, 17 equations, 14 figures, 2 tables, 6 algorithms.

Key Result

Proposition 3.1

For $\alpha \in (0, 0.5)$, consider any prediction set function $\mathcal{C}_{\hat{\tau}}$ satisfying Let $\lambda_\alpha := h_b(\alpha) + \alpha \log | \mathcal{Y} | - (1 - \alpha_N) \log (1 - \alpha)$, where $h_b$ is the binary entropy function. For the true distribution $P_{\mathrm{data}}$, we have

Figures (14)

  • Figure 1: Example of diagnosis via Patient and Doctor LLM interaction.
  • Figure 2: IP with calibrated (solid) versus uncalibrated (dashed) measures of uncertainty.
  • Figure 3: Evaluations of C-IP with a closed query set $\mathcal{Q}_{\text{closed}}$(top row) and open query set $\mathcal{Q}_{\text{open}}$(bottom row). Each curve shows the average performance with shaded area as std.. Left column: Performance on the 20 Questions task of C-IP and baselines with binary (dashed) and non-binary (solid) query answers using Llama-3.1-8b. Middle column: Uncertainty estimated by IP and C-IP, with each curve evaluating the uncertainty of the selected query at each iteration. Right column: Desired (dashed) and empirical (solid) coverage of C-IP. Hyperparameter choice is marked with different colors, with desired and empirical coverage as dashed and solid.
  • Figure 4: Comparison with Probability Calibration Baselines under open query set setting $\mathcal{Q}_{\text{open}}$ with binary (top) and free-text (bottom) query answers.
  • Figure 5: Top: Predictive Performance in Medical Interactive Question Answering on the MediQ dataset. The tasks are divided based on specialty. Desired coverage is $1-\alpha=0.8$ for IM and P and $1-\alpha=0.7$ for N. Bottom: Comparison of empirical and desired coverage for C-IP. Desired coverage (dashed curve) is shown for $1 - \alpha \in \{0.7, 0.8, 0.9\}$. Since the number of queries may depend on the datapoints, Empirical Coverage (solid curve) at iteration $k$ is evaluated for all test datapoints that have stopped at or before iteration $k$. Each curve is averaged over three splits, with the shaded area denoting their standard deviation.
  • ...and 9 more figures

Theorems & Definitions (1)

  • Proposition 3.1: correia2024information, simplified