Full-ECE: A Metric For Token-level Calibration on Large Language Models
Han Liu, Yupeng Zhang, Bingning Wang, Weipeng Chen, Xiaolin Hu
TL;DR
This work tackles the inadequacy of traditional calibration metrics for token-level uncertainty in Large Language Models with vast vocabularies. It introduces full calibration and its metric $Full\text{-}ECE$, which treats outputs as samples from the full predicted distribution and binning across $[0,1]$ for all tokens, with $Full\text{-}ECE=\sum_{m=1}^M \frac{|B^*_m|}{N} |A^*_m - C^*_m|$, where $|B^*_m|=\sum_{k=1}^K |B_{m,k}|$, $A^*_m=\frac{\sum_{k=1}^K \sum_{i\in B_{m,k}} \mathds{1}(y_i=k)}{|B^*_m|}$, and $C^*_m=\frac{\sum_{k=1}^K \sum_{i\in B_{m,k}} p_i^{(k)}}{|B^*_m|}$. By leveraging Bayes' rule to relate $\mathbb{P}(y^*=y|p^*)$ to joint/marginal probabilities, the metric captures calibration across the entire distribution rather than only the top-1 class. Empirically, $Full\text{-}ECE$ shows superior robustness to binning choices and improves during LLM training, offering a more reliable tool for evaluating and improving token-level uncertainty in large vocabularies.
Abstract
Deep Neural Networks (DNNs) excel in various domains but face challenges in providing accurate uncertainty estimates, which are crucial for high-stakes applications. Large Language Models (LLMs) have recently emerged as powerful tools, demonstrating exceptional performance in language tasks. However, traditional calibration metrics such as Expected Calibration Error (ECE) and classwise-ECE (cw-ECE) are inadequate for LLMs due to their vast vocabularies, data complexity, and distributional focus. To address this, we propose a novel calibration concept called full calibration and introduce its corresponding metric, Full-ECE. Full-ECE evaluates the entire predicted probability distribution, offering a more accurate and robust measure of calibration for LLMs.
