Table of Contents
Fetching ...

Calibration in Deep Learning: A Survey of the State-of-the-Art

Cheng Wang

TL;DR

This survey systematically categorizes and analyzes calibration methods for deep learning, highlighting why high-performing models remain poorly calibrated and how post-hoc, training-time regularization, uncertainty estimation, and hybrid approaches can improve reliability. It foregrounds metrics such as ECE, MCE, and Brier score, and surveys methods from TS and Dirichlet calibration to DAC and MMCE, with special attention to large models and LLMs. Practical considerations, domain applications, and open issues—such as calibration under distribution shift, data bias, and generative-model calibration—are discussed, providing a roadmap for robust, trustworthy calibration in real-world AI systems. The work underscores the trade-offs between calibration quality, computational cost, and deployment constraints, offering guidance for choosing and combining techniques in diverse settings.

Abstract

Calibrating deep neural models plays an important role in building reliable, robust AI systems in safety-critical applications. Recent work has shown that modern neural networks that possess high predictive capability are poorly calibrated and produce unreliable model predictions. Though deep learning models achieve remarkable performance on various benchmarks, the study of model calibration and reliability is relatively under-explored. Ideal deep models should have not only high predictive performance but also be well calibrated. There have been some recent advances in calibrating deep models. In this survey, we review the state-of-the-art calibration methods and their principles for performing model calibration. First, we start with the definition of model calibration and explain the root causes of model miscalibration. Then we introduce the key metrics that can measure this aspect. It is followed by a summary of calibration methods that we roughly classify into four categories: post-hoc calibration, regularization methods, uncertainty estimation, and composition methods. We also cover recent advancements in calibrating large models, particularly large language models (LLMs). Finally, we discuss some open issues, challenges, and potential directions.

Calibration in Deep Learning: A Survey of the State-of-the-Art

TL;DR

This survey systematically categorizes and analyzes calibration methods for deep learning, highlighting why high-performing models remain poorly calibrated and how post-hoc, training-time regularization, uncertainty estimation, and hybrid approaches can improve reliability. It foregrounds metrics such as ECE, MCE, and Brier score, and surveys methods from TS and Dirichlet calibration to DAC and MMCE, with special attention to large models and LLMs. Practical considerations, domain applications, and open issues—such as calibration under distribution shift, data bias, and generative-model calibration—are discussed, providing a roadmap for robust, trustworthy calibration in real-world AI systems. The work underscores the trade-offs between calibration quality, computational cost, and deployment constraints, offering guidance for choosing and combining techniques in diverse settings.

Abstract

Calibrating deep neural models plays an important role in building reliable, robust AI systems in safety-critical applications. Recent work has shown that modern neural networks that possess high predictive capability are poorly calibrated and produce unreliable model predictions. Though deep learning models achieve remarkable performance on various benchmarks, the study of model calibration and reliability is relatively under-explored. Ideal deep models should have not only high predictive performance but also be well calibrated. There have been some recent advances in calibrating deep models. In this survey, we review the state-of-the-art calibration methods and their principles for performing model calibration. First, we start with the definition of model calibration and explain the root causes of model miscalibration. Then we introduce the key metrics that can measure this aspect. It is followed by a summary of calibration methods that we roughly classify into four categories: post-hoc calibration, regularization methods, uncertainty estimation, and composition methods. We also cover recent advancements in calibrating large models, particularly large language models (LLMs). Finally, we discuss some open issues, challenges, and potential directions.
Paper Structure (49 sections, 36 equations, 4 figures, 2 tables)

This paper contains 49 sections, 36 equations, 4 figures, 2 tables.

Figures (4)

  • Figure 1: Comparison of reliability diagrams and predictive probability between uncalibrated and calibrated binary classification models. (a) and (b) correspond to the uncalibrated model trained with standard cross-entropy loss; (c) and (d) show the calibrated model trained with focal loss ($\gamma=5$). Both models achieve similar accuracy (83.8% and 83.4%, respectively), but differ in calibration. Reliability diagrams (a) and (c) plot predicted confidence (x-axis) versus empirical accuracy (y-axis) using 10 bins. The dashed diagonal represents perfect calibration, while the Expected Calibration Error (ECE) and Maximum Calibration Error (MCE) quantify calibration performance. Histograms (b) and (d) display the distribution of predicted probabilities on 1000 test samples. Pink bars indicate the calibration error (i.e., the absolute difference between average confidence and empirical accuracy) in each bin.
  • Figure 2: Illustration depicting factors influencing model calibration. Modern neural networks often exhibit overconfidence, which is closely associated with overfitting during model training. Overfitting is commonly exacerbated by inadequate regularization techniques and the presence of data bias within training datasets, particularly in over-parameterized models.
  • Figure 3: Categorization of calibration methods and representative approaches.
  • Figure 4: The reliability diagrams for a model trained on CIFAR100 with different bin numbers (left to right: 20, 50, 100). The diagonal dash presents perfect calibration, the red bar presents the gap to perfect calibration on each bin.The calibration error is sensitive to increasing bin numbers.