Table of Contents
Fetching ...

Just rephrase it! Uncertainty estimation in closed-source language models via multiple rephrased queries

Adam Yang, Chen Chen, Konstantinos Pitas

TL;DR

The paper addresses uncertainty estimation for closed-source LLMs that do not disclose internal logits, proposing a practical approach based on querying multiple rephrasings of a base question to gauge answer consistency. It shows that simple rephrasings—notably synonym substitutions (reword) and expanded questions (expansion)—substantially improve calibration for top-1 decoding and can approach the calibration level achieved when access to last-layer logits is available. A theoretical framework links the rephrasing-affected uncertainty to the final-layer distribution via a logistic-noise model, with empirical validation demonstrating calibration gains and compatibility with white-box uncertainty. When extended to top-k decoding, rephrasings temper the top-class probability, further enhancing calibration across several datasets and models, offering a practical, model-agnostic tool for uncertainty estimation in real-world, black-box LLM deployments. The work also situates itself among prior uncertainty estimation methods, highlighting the practicality and adaptability of rephrasing strategies as a calibration mechanism for critical decision-making with closed-source models.

Abstract

State-of-the-art large language models are sometimes distributed as open-source software but are also increasingly provided as a closed-source service. These closed-source large-language models typically see the widest usage by the public, however, they often do not provide an estimate of their uncertainty when responding to queries. As even the best models are prone to ``hallucinating" false information with high confidence, a lack of a reliable estimate of uncertainty limits the applicability of these models in critical settings. We explore estimating the uncertainty of closed-source LLMs via multiple rephrasings of an original base query. Specifically, we ask the model, multiple rephrased questions, and use the similarity of the answers as an estimate of uncertainty. We diverge from previous work in i) providing rules for rephrasing that are simple to memorize and use in practice ii) proposing a theoretical framework for why multiple rephrased queries obtain calibrated uncertainty estimates. Our method demonstrates significant improvements in the calibration of uncertainty estimates compared to the baseline and provides intuition as to how query strategies should be designed for optimal test calibration.

Just rephrase it! Uncertainty estimation in closed-source language models via multiple rephrased queries

TL;DR

The paper addresses uncertainty estimation for closed-source LLMs that do not disclose internal logits, proposing a practical approach based on querying multiple rephrasings of a base question to gauge answer consistency. It shows that simple rephrasings—notably synonym substitutions (reword) and expanded questions (expansion)—substantially improve calibration for top-1 decoding and can approach the calibration level achieved when access to last-layer logits is available. A theoretical framework links the rephrasing-affected uncertainty to the final-layer distribution via a logistic-noise model, with empirical validation demonstrating calibration gains and compatibility with white-box uncertainty. When extended to top-k decoding, rephrasings temper the top-class probability, further enhancing calibration across several datasets and models, offering a practical, model-agnostic tool for uncertainty estimation in real-world, black-box LLM deployments. The work also situates itself among prior uncertainty estimation methods, highlighting the practicality and adaptability of rephrasing strategies as a calibration mechanism for critical decision-making with closed-source models.

Abstract

State-of-the-art large language models are sometimes distributed as open-source software but are also increasingly provided as a closed-source service. These closed-source large-language models typically see the widest usage by the public, however, they often do not provide an estimate of their uncertainty when responding to queries. As even the best models are prone to ``hallucinating" false information with high confidence, a lack of a reliable estimate of uncertainty limits the applicability of these models in critical settings. We explore estimating the uncertainty of closed-source LLMs via multiple rephrasings of an original base query. Specifically, we ask the model, multiple rephrased questions, and use the similarity of the answers as an estimate of uncertainty. We diverge from previous work in i) providing rules for rephrasing that are simple to memorize and use in practice ii) proposing a theoretical framework for why multiple rephrased queries obtain calibrated uncertainty estimates. Our method demonstrates significant improvements in the calibration of uncertainty estimates compared to the baseline and provides intuition as to how query strategies should be designed for optimal test calibration.
Paper Structure (10 sections, 4 theorems, 15 equations, 4 figures, 13 tables)

This paper contains 10 sections, 4 theorems, 15 equations, 4 figures, 13 tables.

Key Result

Proposition 3.1

Let $f : \mathcal{X} \rightarrow \mathcal{Y}$ be an LLM, $\boldsymbol{x}$ is a base query and $\mathcal{T}(\boldsymbol{x})\sim \tau$ is some randomized transformation of the base query. Let be the probability of sampling the most probable answer $A \in \mathcal{Y}$ under transformations $\mathcal{T}(\boldsymbol{x})\sim \tau$. Let $\boldsymbol{z}_{mean}+\epsilon_{rephrase}$ be the latent represent

Figures (4)

  • Figure 1: Multiple rephrased queries for uncertainty estimation. Top row: Querying a closed-source LLM only once with a base query may yield an incorrect top-1 prediction. In the absence of additional information, the naive baseline is to assign $100\%$ confidence to this singular prediction. Bottom row: Querying the model multiple times with rephrased versions of the base query produces the $\{\mathrm{Athens}\}$ class twice and the $\{\mathrm{Paris}\}$ class once. This is roughly equivalent to $66.6\%$ confidence. This observation should serve as an alert to a potential error, even when the true label is unknown.
  • Figure 2: The behavior of the Accuracy, ECE, TACE, Brier, and AUROC for all datasets, architectures, and expansion methods, as we increase the number of samples. We plot the average value as well as confidence intervals $\pm2\sigma$. We see that the ECE and the AUROC improve with more samples while the accuracy drops slightly. This might be because the meaning of some queries is completely destroyed by our rephrasings. The Brier score captures this tradeoff by having a minimum at approximately 5 samples. The TACE remains relatively stable with respect to the number of samples.
  • Figure 3: We plot the distribution of $p_A(\boldsymbol{x})$ for the case of top-k decoding with and without rephrasing, for all datasets, models, and rephrasing methods. We see that rephrasing primarily acts to temper the probability of the most probable class $A$, thus making the model less confident and possibly better calibrated. We also plot the logistic (blue), and empirical cdf (red) for $\boldsymbol{\mathrm{w}}^{\top} \epsilon_{rephrase}\sim \rho$ for Mistral-7B, ARC-Challenge, and the "expansion" rephrasing method for top-1 decoding. $\rho$ is often close to a logistic distribution.
  • Figure 4: We plot the AUROC averaged over all models for each dataset and for each uncertainty estimation method. We observe that top-k improves over the naive top-1 decoding. Furthermore, the best rephrasing method (denoted as rephrase*) improves the AUROC significantly in all cases.

Theorems & Definitions (6)

  • Proposition 3.1
  • Proposition 4.1
  • Proposition C.1
  • proof
  • Proposition C.2
  • proof