Table of Contents
Fetching ...

Structured Matrix Scaling for Multi-Class Calibration

Eugène Berta, David Holzmüller, Michael I. Jordan, Francis Bach

TL;DR

The paper tackles miscalibration in multiclass probability estimates and argues for expressive post-hoc calibration beyond temperature scaling. By deriving a theoretically motivated quadratic softmax calibration and introducing structured regularization (SMS and SVS), it adapts model complexity to the calibration data to avoid overfitting. The authors provide an efficient open-source probmetrics implementation using SAGA optimization, accompanied by a principled hyperparameter grid search and meta-learning guidance. Extensive experiments across tabular and computer vision benchmarks show that the proposed methods consistently improve calibration, particularly as the number of classes increases, offering a practical and scalable alternative to existing calibration techniques.

Abstract

Post-hoc recalibration methods are widely used to ensure that classifiers provide faithful probability estimates. We argue that parametric recalibration functions based on logistic regression can be motivated from a simple theoretical setting for both binary and multiclass classification. This insight motivates the use of more expressive calibration methods beyond standard temperature scaling. For multi-class calibration however, a key challenge lies in the increasing number of parameters introduced by more complex models, often coupled with limited calibration data, which can lead to overfitting. Through extensive experiments, we demonstrate that the resulting bias-variance tradeoff can be effectively managed by structured regularization, robust preprocessing and efficient optimization. The resulting methods lead to substantial gains over existing logistic-based calibration techniques. We provide efficient and easy-to-use open-source implementations of our methods, making them an attractive alternative to common temperature, vector, and matrix scaling implementations.

Structured Matrix Scaling for Multi-Class Calibration

TL;DR

The paper tackles miscalibration in multiclass probability estimates and argues for expressive post-hoc calibration beyond temperature scaling. By deriving a theoretically motivated quadratic softmax calibration and introducing structured regularization (SMS and SVS), it adapts model complexity to the calibration data to avoid overfitting. The authors provide an efficient open-source probmetrics implementation using SAGA optimization, accompanied by a principled hyperparameter grid search and meta-learning guidance. Extensive experiments across tabular and computer vision benchmarks show that the proposed methods consistently improve calibration, particularly as the number of classes increases, offering a practical and scalable alternative to existing calibration techniques.

Abstract

Post-hoc recalibration methods are widely used to ensure that classifiers provide faithful probability estimates. We argue that parametric recalibration functions based on logistic regression can be motivated from a simple theoretical setting for both binary and multiclass classification. This insight motivates the use of more expressive calibration methods beyond standard temperature scaling. For multi-class calibration however, a key challenge lies in the increasing number of parameters introduced by more complex models, often coupled with limited calibration data, which can lead to overfitting. Through extensive experiments, we demonstrate that the resulting bias-variance tradeoff can be effectively managed by structured regularization, robust preprocessing and efficient optimization. The resulting methods lead to substantial gains over existing logistic-based calibration techniques. We provide efficient and easy-to-use open-source implementations of our methods, making them an attractive alternative to common temperature, vector, and matrix scaling implementations.

Paper Structure

This paper contains 24 sections, 32 equations, 10 figures, 3 tables.

Figures (10)

  • Figure 1: Relative differences in test logloss (lower is better) after recalibration for our vector and matrix scaling functions, compared with other existing implementations. Each dot represents the average relative loss difference obtained for one tabular dataset, over 21 experiments (7 models, 3 folds). Box-plots show the 10, 25, 50, 75, and 90% quantiles. Relative differences (y-axis) are plotted using a log scale and clipped to -100% loss (min) and +100% loss (max).
  • Figure 2: Average calibration function fitting time for 1000 samples. We compute the average using all our 1372 multiclass experiments. Error bars are computed as the standard deviation of the fitting time for every experiment (normalized for 1000 samples) divided by the square root of the number of datasets times number of models. Average runtimes (y-axis) are plotted using a log scale.
  • Figure A.1: Relative differences in test logloss (lower is better) after recalibration for our linear, affine and quadratic scaling functions, compared with other existing implementations. Each dot represents the average relative loss difference obtained for one tabular dataset, over 21 experiments (7 models, 3 folds). Box-plots show the 10, 25, 50, 75, and 90% quantiles. Relative differences (y-axis) are plotted using a log scale and clipped to -100% loss (min) and +100% loss (max).
  • Figure A.2: Average calibration function fitting time for 1000 samples. We compute the average using all our 2205 binary experiments. Error bars are computed as the standard deviation of the fitting time for every experiment (normalized for 1000 samples) divided by the square root of the number of datasets times number of models. Average runtimes (y-axis) are plotted using a log scale.
  • Figure C.1: Relative differences in test Brier score (lower is better) after recalibration for our linear, affine and quadratic scaling functions, compared with other existing implementations. Each dot represents the average relative loss difference obtained for one tabular dataset, over 21 experiments (7 models, 3 folds). Box-plots show the 10, 25, 50, 75, and 90% quantiles. Relative differences (y-axis) are plotted using a log scale and clipped to -100% loss (min) and +100% loss (max).
  • ...and 5 more figures

Theorems & Definitions (6)

  • Remark
  • Remark
  • Remark
  • Remark
  • Remark
  • Remark