Table of Contents
Fetching ...

A Survey of Calibration Process for Black-Box LLMs

Liangru Xie, Hui Liu, Jingying Zeng, Xianfeng Tang, Yan Han, Chen Luo, Jing Huang, Zhen Li, Suhang Wang, Qi He

TL;DR

This survey addresses the calibration gap for black-box LLMs by defining the Calibration Process as two interrelated steps: Confidence Estimation and Calibration, with $V(\text{Correct} \mid \text{Confidence} = v) = v$ illustrating well-calibrated confidence. It categorizes Confidence Estimation into Consistency, Self-Reflections, and Hybrid approaches, including proxy-model and cross-model strategies suitable for API-only interfaces. It then reviews Calibration methods and measurement techniques—post-processing such as histogram binning and isotonic regression, Bayesian and multi-calibration refinements, and error- and correlation-based metrics—to map confidence to correctness without access to model internals. Finally, it discusses applications, limitations, and future directions, highlighting benchmarks, bias mitigation, and long-form calibration as priorities.

Abstract

Large Language Models (LLMs) demonstrate remarkable performance in semantic understanding and generation, yet accurately assessing their output reliability remains a significant challenge. While numerous studies have explored calibration techniques, they primarily focus on White-Box LLMs with accessible parameters. Black-Box LLMs, despite their superior performance, pose heightened requirements for calibration techniques due to their API-only interaction constraints. Although recent researches have achieved breakthroughs in black-box LLMs calibration, a systematic survey of these methodologies is still lacking. To bridge this gap, we presents the first comprehensive survey on calibration techniques for black-box LLMs. We first define the Calibration Process of LLMs as comprising two interrelated key steps: Confidence Estimation and Calibration. Second, we conduct a systematic review of applicable methods within black-box settings, and provide insights on the unique challenges and connections in implementing these key steps. Furthermore, we explore typical applications of Calibration Process in black-box LLMs and outline promising future research directions, providing new perspectives for enhancing reliability and human-machine alignment. This is our GitHub link: https://github.com/LiangruXie/Calibration-Process-in-Black-Box-LLMs

A Survey of Calibration Process for Black-Box LLMs

TL;DR

This survey addresses the calibration gap for black-box LLMs by defining the Calibration Process as two interrelated steps: Confidence Estimation and Calibration, with illustrating well-calibrated confidence. It categorizes Confidence Estimation into Consistency, Self-Reflections, and Hybrid approaches, including proxy-model and cross-model strategies suitable for API-only interfaces. It then reviews Calibration methods and measurement techniques—post-processing such as histogram binning and isotonic regression, Bayesian and multi-calibration refinements, and error- and correlation-based metrics—to map confidence to correctness without access to model internals. Finally, it discusses applications, limitations, and future directions, highlighting benchmarks, bias mitigation, and long-form calibration as priorities.

Abstract

Large Language Models (LLMs) demonstrate remarkable performance in semantic understanding and generation, yet accurately assessing their output reliability remains a significant challenge. While numerous studies have explored calibration techniques, they primarily focus on White-Box LLMs with accessible parameters. Black-Box LLMs, despite their superior performance, pose heightened requirements for calibration techniques due to their API-only interaction constraints. Although recent researches have achieved breakthroughs in black-box LLMs calibration, a systematic survey of these methodologies is still lacking. To bridge this gap, we presents the first comprehensive survey on calibration techniques for black-box LLMs. We first define the Calibration Process of LLMs as comprising two interrelated key steps: Confidence Estimation and Calibration. Second, we conduct a systematic review of applicable methods within black-box settings, and provide insights on the unique challenges and connections in implementing these key steps. Furthermore, we explore typical applications of Calibration Process in black-box LLMs and outline promising future research directions, providing new perspectives for enhancing reliability and human-machine alignment. This is our GitHub link: https://github.com/LiangruXie/Calibration-Process-in-Black-Box-LLMs

Paper Structure

This paper contains 22 sections, 10 equations, 2 figures, 2 tables.

Figures (2)

  • Figure 1: Calibration Process. As shown in this figure, LLM responses first go through the Confidence Estimation module to obtain confidence values, while correctness values are determined based on the specific task. In the Calibration module, Calibration Methods reduce the calibration error by minimizing the gap between confidence and correctness values, followed by Measurement Methods to assess the actual calibration error. The primary objective of the entire calibration process is to obtain well-calibrated confidence values that accurately reflect the response quality. Additionally, this figure demonstrates that methods used in black-box LLMs are fully incorporated into each module of white-box LLMs, while the application of white-box methods in black-box LLMs remains limited.
  • Figure 2: Organizational Structure of the Calibration Process for black-box LLMs.