Table of Contents
Fetching ...

Full-ECE: A Metric For Token-level Calibration on Large Language Models

Han Liu, Yupeng Zhang, Bingning Wang, Weipeng Chen, Xiaolin Hu

TL;DR

This work tackles the inadequacy of traditional calibration metrics for token-level uncertainty in Large Language Models with vast vocabularies. It introduces full calibration and its metric $Full\text{-}ECE$, which treats outputs as samples from the full predicted distribution and binning across $[0,1]$ for all tokens, with $Full\text{-}ECE=\sum_{m=1}^M \frac{|B^*_m|}{N} |A^*_m - C^*_m|$, where $|B^*_m|=\sum_{k=1}^K |B_{m,k}|$, $A^*_m=\frac{\sum_{k=1}^K \sum_{i\in B_{m,k}} \mathds{1}(y_i=k)}{|B^*_m|}$, and $C^*_m=\frac{\sum_{k=1}^K \sum_{i\in B_{m,k}} p_i^{(k)}}{|B^*_m|}$. By leveraging Bayes' rule to relate $\mathbb{P}(y^*=y|p^*)$ to joint/marginal probabilities, the metric captures calibration across the entire distribution rather than only the top-1 class. Empirically, $Full\text{-}ECE$ shows superior robustness to binning choices and improves during LLM training, offering a more reliable tool for evaluating and improving token-level uncertainty in large vocabularies.

Abstract

Deep Neural Networks (DNNs) excel in various domains but face challenges in providing accurate uncertainty estimates, which are crucial for high-stakes applications. Large Language Models (LLMs) have recently emerged as powerful tools, demonstrating exceptional performance in language tasks. However, traditional calibration metrics such as Expected Calibration Error (ECE) and classwise-ECE (cw-ECE) are inadequate for LLMs due to their vast vocabularies, data complexity, and distributional focus. To address this, we propose a novel calibration concept called full calibration and introduce its corresponding metric, Full-ECE. Full-ECE evaluates the entire predicted probability distribution, offering a more accurate and robust measure of calibration for LLMs.

Full-ECE: A Metric For Token-level Calibration on Large Language Models

TL;DR

This work tackles the inadequacy of traditional calibration metrics for token-level uncertainty in Large Language Models with vast vocabularies. It introduces full calibration and its metric , which treats outputs as samples from the full predicted distribution and binning across for all tokens, with , where , , and . By leveraging Bayes' rule to relate to joint/marginal probabilities, the metric captures calibration across the entire distribution rather than only the top-1 class. Empirically, shows superior robustness to binning choices and improves during LLM training, offering a more reliable tool for evaluating and improving token-level uncertainty in large vocabularies.

Abstract

Deep Neural Networks (DNNs) excel in various domains but face challenges in providing accurate uncertainty estimates, which are crucial for high-stakes applications. Large Language Models (LLMs) have recently emerged as powerful tools, demonstrating exceptional performance in language tasks. However, traditional calibration metrics such as Expected Calibration Error (ECE) and classwise-ECE (cw-ECE) are inadequate for LLMs due to their vast vocabularies, data complexity, and distributional focus. To address this, we propose a novel calibration concept called full calibration and introduce its corresponding metric, Full-ECE. Full-ECE evaluates the entire predicted probability distribution, offering a more accurate and robust measure of calibration for LLMs.
Paper Structure (8 sections, 13 equations, 2 figures, 1 table)

This paper contains 8 sections, 13 equations, 2 figures, 1 table.

Figures (2)

  • Figure 1: (a), (b), and (c) demonstrate how ECE, cw-ECE, and Full-ECE aggregate statistics per bin for a 4-class classification task. Red numbers represent maximum probability values, and blue dashed boxes indicate ground truth label categories. ECE considers only these red maximum values, cw-ECE computes ECE for each class individually, while Full-ECE enhances cw-ECE by combining bins across different classes for statistical analysis.
  • Figure 2: Full-ECE with varying training data