Table of Contents
Fetching ...

When is Multicalibration Post-Processing Necessary?

Dutch Hansen, Siddartha Devic, Preetum Nakkiran, Vatsal Sharan

TL;DR

This first comprehensive study evaluating the usefulness of multicalibration post-processing across a broad set of tabular, image, and language datasets for models spanning from simple decision trees to 90 million parameter fine-tuned LLMs finds that models which are calibrated out of the box tend to be relatively multicalibrated without any additional post-processing.

Abstract

Calibration is a well-studied property of predictors which guarantees meaningful uncertainty estimates. Multicalibration is a related notion -- originating in algorithmic fairness -- which requires predictors to be simultaneously calibrated over a potentially complex and overlapping collection of protected subpopulations (such as groups defined by ethnicity, race, or income). We conduct the first comprehensive study evaluating the usefulness of multicalibration post-processing across a broad set of tabular, image, and language datasets for models spanning from simple decision trees to 90 million parameter fine-tuned LLMs. Our findings can be summarized as follows: (1) models which are calibrated out of the box tend to be relatively multicalibrated without any additional post-processing; (2) multicalibration post-processing can help inherently uncalibrated models and large vision and language models; and (3) traditional calibration measures may sometimes provide multicalibration implicitly. More generally, we also distill many independent observations which may be useful for practical and effective applications of multicalibration post-processing in real-world contexts. We also release a python package implementing multicalibration algorithms, available via `pip install multicalibration'.

When is Multicalibration Post-Processing Necessary?

TL;DR

This first comprehensive study evaluating the usefulness of multicalibration post-processing across a broad set of tabular, image, and language datasets for models spanning from simple decision trees to 90 million parameter fine-tuned LLMs finds that models which are calibrated out of the box tend to be relatively multicalibrated without any additional post-processing.

Abstract

Calibration is a well-studied property of predictors which guarantees meaningful uncertainty estimates. Multicalibration is a related notion -- originating in algorithmic fairness -- which requires predictors to be simultaneously calibrated over a potentially complex and overlapping collection of protected subpopulations (such as groups defined by ethnicity, race, or income). We conduct the first comprehensive study evaluating the usefulness of multicalibration post-processing across a broad set of tabular, image, and language datasets for models spanning from simple decision trees to 90 million parameter fine-tuned LLMs. Our findings can be summarized as follows: (1) models which are calibrated out of the box tend to be relatively multicalibrated without any additional post-processing; (2) multicalibration post-processing can help inherently uncalibrated models and large vision and language models; and (3) traditional calibration measures may sometimes provide multicalibration implicitly. More generally, we also distill many independent observations which may be useful for practical and effective applications of multicalibration post-processing in real-world contexts. We also release a python package implementing multicalibration algorithms, available via `pip install multicalibration'.
Paper Structure (49 sections, 61 figures)

This paper contains 49 sections, 61 figures.

Figures (61)

  • Figure 1: Test accuracy vs. maximum group-wise calibration error (smECE) averaged over five train/validation splits for simple neural networks (MLPs) trained on Credit Default, MEPS, and ACS Income. Each point corresponds to the performance of the multicalibration post-processing algorithm $\texttt{HKRR}$hebert2018multicalibration or $\texttt{HJZ}$haghtalab2024unifying with a different choice of hyperparameters. Standard empirical risk minimization (ERM) for MLPs achieves nearly optimal accuracy and multicalibration error. Full hyperparameter plots for each base dataset are in \ref{['sec:tabular_datasets_results']}.
  • Figure 2: Best performing $\texttt{HKRR}$ and $\texttt{HJZ}$ post-processing algorithm hyperparameters (selected based on validation max smECE) compared to ERM on the MEPS dataset. Calibrated models (MLP, random forest, logistic regression) need not be post-processed to achieve multicalibration. However, uncalibrated models (SVM, decision trees, naive Bayes) do benefit from multicalibration post-processing algorithms. Cells highlighted in blue show the importance of the choice of metric for selecting the best post-processing method for decision trees. Metric choice --- worst group ECE vs. worst group smECE --- can change which of ERM or $\texttt{HJZ}$ is preferable.
  • Figure 3: (Left/Middle): Hold-out calibration fraction vs. worst group calibration error (left) and accuracy (right) for MLPs on HMDA. The left plot shows that the best max group smECE is achieved at 60% of data used for base model training, and 40% of data for multicalibration post-processing. However, the right plot shows that the best accuracy is achieved by using 90% of data for base model training, and the remaining for multicalibration post-processing. This means that a practitioner may face a tradeoff between worst-group calibration error and accuracy. The impact of calibration fraction for each dataset is available in \ref{['sec:cal_frac_tabular']}. (Right): Gap between measured $\textrm{smECE}$ and ECE on every group for every experiment. As sample size increases, the two metrics become very similar. However, some variability exists at lower sample sizes.
  • Figure 4: Top: Test accuracy vs. maximum group-wise calibration error (smECE) averaged over three train/validation splits for ViT and DenseNet on Camelyon17, and DistilBERT on CivilComments. Multicalibration post-processing has scope for improvement in each setting, and does so with nearly no loss in accuracy. Bottom: Impact of multicalibration post-processing algorithms for Civil Comments (DistilBERT) and Amazon Polarity (ResNet-56). Multicalibration post-processing and isotonic regression both offer improvements to worst group calibration error. Full results are available in \ref{['sec:all_mcb_complex_data']}.
  • Figure 5: Impact of reusing all model training data for multicalibration post-processing on HMDA (Left) and CreditDefault (Right) as measured by worst group calibration error (max $\textrm{smECE}$). Results vary; for HMDA, post-processing with reused data essentially performs as well as post-processing by holding out data for all models except random forest postprocessed with $\texttt{HKRR}$. However, on CreditDefault, we find that data reuse can harm post-processing across the board. Plots for each dataset available in \ref{['sec:reuse-data-full-plots']}.
  • ...and 56 more figures

Theorems & Definitions (1)

  • Remark 1