Table of Contents
Fetching ...

Calibrating Verbalized Probabilities for Large Language Models

Cheng Wang, Gyuri Szarvas, Georges Balazs, Pavel Danchenko, Patrick Ernst

TL;DR

This paper theoretically and empirically identifies the issue of re-softmax arising from the scaling of verbalized probabilities, and proposes using the invert softmax trick to approximate the"logit" by inverting verbalized probabilities.

Abstract

Calibrating verbalized probabilities presents a novel approach for reliably assessing and leveraging outputs from black-box Large Language Models (LLMs). Recent methods have demonstrated improved calibration by applying techniques like Platt scaling or temperature scaling to the confidence scores generated by LLMs. In this paper, we explore the calibration of verbalized probability distributions for discriminative tasks. First, we investigate the capability of LLMs to generate probability distributions over categorical labels. We theoretically and empirically identify the issue of re-softmax arising from the scaling of verbalized probabilities, and propose using the invert softmax trick to approximate the "logit" by inverting verbalized probabilities. Through extensive evaluation on three public datasets, we demonstrate: (1) the robust capability of LLMs in generating class distributions, and (2) the effectiveness of the invert softmax trick in estimating logits, which, in turn, facilitates post-calibration adjustments.

Calibrating Verbalized Probabilities for Large Language Models

TL;DR

This paper theoretically and empirically identifies the issue of re-softmax arising from the scaling of verbalized probabilities, and proposes using the invert softmax trick to approximate the"logit" by inverting verbalized probabilities.

Abstract

Calibrating verbalized probabilities presents a novel approach for reliably assessing and leveraging outputs from black-box Large Language Models (LLMs). Recent methods have demonstrated improved calibration by applying techniques like Platt scaling or temperature scaling to the confidence scores generated by LLMs. In this paper, we explore the calibration of verbalized probability distributions for discriminative tasks. First, we investigate the capability of LLMs to generate probability distributions over categorical labels. We theoretically and empirically identify the issue of re-softmax arising from the scaling of verbalized probabilities, and propose using the invert softmax trick to approximate the "logit" by inverting verbalized probabilities. Through extensive evaluation on three public datasets, we demonstrate: (1) the robust capability of LLMs in generating class distributions, and (2) the effectiveness of the invert softmax trick in estimating logits, which, in turn, facilitates post-calibration adjustments.

Paper Structure

This paper contains 21 sections, 2 theorems, 9 equations, 11 figures, 9 tables.

Key Result

Proposition 1

Given a categorical probability distribution $\mathbf{p} = \{p_1,...,p_i,...,p_K\}$ over $K$ ($K>1$) classes, and a re-softmaxed probability distribution $\mathbf{q} = \textsc{softmax}(\mathbf{p}) = \{q_1,...,q_i,...,q_K\}$, for $i\in [1, K]$, the $q_i$ is bounded: $0<\frac{1}{K-1+e} \leq q_i \leq \

Figures (11)

  • Figure 1: Our proposed method for calibrating verbalized probability based on invert softmax trick. The LLMs generated probabilities are inverted to estimate the logits, and then post-hoc calibration --- temperature scaling is applied thereafter to obtain calibrated probabilities.
  • Figure 2: Probability histogram for (a) LLM generated verbalised probability, (b) re-softmaxed probability, (c) TS on the re-softmaxed probability. The red dashed and gray dashed represent Accuracy and Average Confidence respectively. Their distance represents miscalibration, i.e. the gap to perfect calibration. The (b) and (c) steps are internal stages of TS.
  • Figure 3: (a) The success rate of generated probability distribution and (b) the mean and variance of the sum of probability with different token temperatures $T$.
  • Figure 4: The reliability diagram and confidence histogram for the test set IMDB (top row), Emotion (middle row) and Amazon massive (bottom row). The $\tau$ is tuned based on TS with the corresponding validation set. It is clear to see that TS pushes the average confidence closer to accuracy. The result is based on the Claude-v2 with $t$=0.0. See section \ref{['sec:add_res']} in Appendix for the plots for Claude-v2 with $t$=1.0, Mixtral and Claude-v3.
  • Figure 5: The comparison the probability with 1 decimal and 2 decimal on IMDB test set (positive class).
  • ...and 6 more figures

Theorems & Definitions (5)

  • Definition 1
  • Proposition 1
  • Proposition 2
  • Proof 1
  • Proof 2