Table of Contents
Fetching ...

Improving Multi-Class Calibration through Normalization-Aware Isotonic Techniques

Alon Arad, Saharon Rosset

TL;DR

<3-5 sentence high-level summary> The paper addresses multiclass calibration by extending isotonic regression to normalization-aware settings. It introduces two non-parametric approaches, NA-FIR and SCIR, that integrate probability normalization either directly in the optimization (NA-FIR) or through a cumulative, rank-aware formulation (SCIR). Empirical results across tasks show consistent improvements in NLL and conf-ECE, establishing normalization-aware isotonic methods as strong non-parametric alternatives to parametric calibrators in diverse domains. The work highlights practical trade-offs between calibration quality and computational scalability, offering a flexible toolkit for reliable probabilistic predictions in multiclass problems.

Abstract

Accurate and reliable probability predictions are essential for multi-class supervised learning tasks, where well-calibrated models enable rational decision-making. While isotonic regression has proven effective for binary calibration, its extension to multi-class problems via one-vs-rest calibration produced suboptimal results when compared to parametric methods, limiting its practical adoption. In this work, we propose novel isotonic normalization-aware techniques for multiclass calibration, grounded in natural and intuitive assumptions expected by practitioners. Unlike prior approaches, our methods inherently account for probability normalization by either incorporating normalization directly into the optimization process (NA-FIR) or modeling the problem as a cumulative bivariate isotonic regression (SCIR). Empirical evaluation on a variety of text and image classification datasets across different model architectures reveals that our approach consistently improves negative log-likelihood (NLL) and expected calibration error (ECE) metrics.

Improving Multi-Class Calibration through Normalization-Aware Isotonic Techniques

TL;DR

<3-5 sentence high-level summary> The paper addresses multiclass calibration by extending isotonic regression to normalization-aware settings. It introduces two non-parametric approaches, NA-FIR and SCIR, that integrate probability normalization either directly in the optimization (NA-FIR) or through a cumulative, rank-aware formulation (SCIR). Empirical results across tasks show consistent improvements in NLL and conf-ECE, establishing normalization-aware isotonic methods as strong non-parametric alternatives to parametric calibrators in diverse domains. The work highlights practical trade-offs between calibration quality and computational scalability, offering a flexible toolkit for reliable probabilistic predictions in multiclass problems.

Abstract

Accurate and reliable probability predictions are essential for multi-class supervised learning tasks, where well-calibrated models enable rational decision-making. While isotonic regression has proven effective for binary calibration, its extension to multi-class problems via one-vs-rest calibration produced suboptimal results when compared to parametric methods, limiting its practical adoption. In this work, we propose novel isotonic normalization-aware techniques for multiclass calibration, grounded in natural and intuitive assumptions expected by practitioners. Unlike prior approaches, our methods inherently account for probability normalization by either incorporating normalization directly into the optimization process (NA-FIR) or modeling the problem as a cumulative bivariate isotonic regression (SCIR). Empirical evaluation on a variety of text and image classification datasets across different model architectures reveals that our approach consistently improves negative log-likelihood (NLL) and expected calibration error (ECE) metrics.

Paper Structure

This paper contains 40 sections, 9 theorems, 61 equations, 14 figures, 5 tables, 4 algorithms.

Key Result

Proposition 4.1

The cumulative sorted problem defined in cu-sorted-problem can be solved in $O(m^2k^4)$ worst case time using alg:maximal-upper-set.

Figures (14)

  • Figure 1: Confidence Reliability plots for NG20 BERT-Large-Uncased classifier, p-value is calcluated as suggested by Vaicenavicius2019Evaluating-mode under the null of perfectly calibrated procedure. As can be seen the uncalibrated model is highly over-confident and both our suggested methods are the only ones where we get positive p-value.
  • Figure 2: Both plots are based on the NG20 dataset with trained BERT-large-uncased classifer. The left plot illustrates fitted calibration curves for the first 5 ranks cumulative trained models, where the final prediction for each rank (cumulative class) is calculated as the difference between its cumulative value and the previous one. The right plot provides insights into the effect on test predicted probabilities comparing NA-FIR to FIR predictions as function of Uncalibrated predictions that were thresholded to lie within the $[0.01, 1]$ range and subsequently binned into 30 equal-width intervals. The lower chart depicts the corresponding bin sizes.
  • Figure 3: Comparison of conf-ECE and NLL between different calibration methods. Each cell in the heatmap represent the percentage of times the calibration method on the row had achieved better score then the calibration method on the column.
  • Figure 4: A simple example for block split with values $block\_size\_split\_threshold=2$, $min\_blocks=6$ and original PAVA solution of block-structure.
  • Figure 5: fit time in seconds for dynamic grid partition vs min cut partition
  • ...and 9 more figures

Theorems & Definitions (27)

  • Definition 2.1
  • Definition 2.2
  • Definition 2.3
  • Definition 2.4
  • Proposition 4.1
  • Definition 1.1
  • Definition 1.2
  • Theorem 1.3
  • proof
  • Definition 1.4
  • ...and 17 more