When is Multicalibration Post-Processing Necessary?

Dutch Hansen; Siddartha Devic; Preetum Nakkiran; Vatsal Sharan

When is Multicalibration Post-Processing Necessary?

Dutch Hansen, Siddartha Devic, Preetum Nakkiran, Vatsal Sharan

TL;DR

This first comprehensive study evaluating the usefulness of multicalibration post-processing across a broad set of tabular, image, and language datasets for models spanning from simple decision trees to 90 million parameter fine-tuned LLMs finds that models which are calibrated out of the box tend to be relatively multicalibrated without any additional post-processing.

Abstract

Calibration is a well-studied property of predictors which guarantees meaningful uncertainty estimates. Multicalibration is a related notion -- originating in algorithmic fairness -- which requires predictors to be simultaneously calibrated over a potentially complex and overlapping collection of protected subpopulations (such as groups defined by ethnicity, race, or income). We conduct the first comprehensive study evaluating the usefulness of multicalibration post-processing across a broad set of tabular, image, and language datasets for models spanning from simple decision trees to 90 million parameter fine-tuned LLMs. Our findings can be summarized as follows: (1) models which are calibrated out of the box tend to be relatively multicalibrated without any additional post-processing; (2) multicalibration post-processing can help inherently uncalibrated models and large vision and language models; and (3) traditional calibration measures may sometimes provide multicalibration implicitly. More generally, we also distill many independent observations which may be useful for practical and effective applications of multicalibration post-processing in real-world contexts. We also release a python package implementing multicalibration algorithms, available via `pip install multicalibration'.

When is Multicalibration Post-Processing Necessary?

TL;DR

Abstract

Paper Structure (49 sections, 61 figures)

This paper contains 49 sections, 61 figures.

Introduction
Our Contributions
Related Works: Theory and Practice
Preliminaries
Multicalibration Post-Processing Algorithms and Hyperparameter Selection
Subgroup Selection and Experimental Methodology
Data Partitioning.
Experiments on Tabular Datasets
Experiments on Language and Vision Datasets
Simplifying Data Partitioning with Data Reuse
Takeaways for Practitioners and Discussion
Subgroup Design Considerations
Experimental Limitations and Conclusion
Acknowledgements.
Additional Related Work
...and 34 more sections

Figures (61)

Figure 1: Test accuracy vs. maximum group-wise calibration error (smECE) averaged over five train/validation splits for simple neural networks (MLPs) trained on Credit Default, MEPS, and ACS Income. Each point corresponds to the performance of the multicalibration post-processing algorithm $\texttt{HKRR}$hebert2018multicalibration or $\texttt{HJZ}$haghtalab2024unifying with a different choice of hyperparameters. Standard empirical risk minimization (ERM) for MLPs achieves nearly optimal accuracy and multicalibration error. Full hyperparameter plots for each base dataset are in \ref{['sec:tabular_datasets_results']}.
Figure 2: Best performing $\texttt{HKRR}$ and $\texttt{HJZ}$ post-processing algorithm hyperparameters (selected based on validation max smECE) compared to ERM on the MEPS dataset. Calibrated models (MLP, random forest, logistic regression) need not be post-processed to achieve multicalibration. However, uncalibrated models (SVM, decision trees, naive Bayes) do benefit from multicalibration post-processing algorithms. Cells highlighted in blue show the importance of the choice of metric for selecting the best post-processing method for decision trees. Metric choice --- worst group ECE vs. worst group smECE --- can change which of ERM or $\texttt{HJZ}$ is preferable.
Figure 3: (Left/Middle): Hold-out calibration fraction vs. worst group calibration error (left) and accuracy (right) for MLPs on HMDA. The left plot shows that the best max group smECE is achieved at 60% of data used for base model training, and 40% of data for multicalibration post-processing. However, the right plot shows that the best accuracy is achieved by using 90% of data for base model training, and the remaining for multicalibration post-processing. This means that a practitioner may face a tradeoff between worst-group calibration error and accuracy. The impact of calibration fraction for each dataset is available in \ref{['sec:cal_frac_tabular']}. (Right): Gap between measured $\textrm{smECE}$ and ECE on every group for every experiment. As sample size increases, the two metrics become very similar. However, some variability exists at lower sample sizes.
Figure 4: Top: Test accuracy vs. maximum group-wise calibration error (smECE) averaged over three train/validation splits for ViT and DenseNet on Camelyon17, and DistilBERT on CivilComments. Multicalibration post-processing has scope for improvement in each setting, and does so with nearly no loss in accuracy. Bottom: Impact of multicalibration post-processing algorithms for Civil Comments (DistilBERT) and Amazon Polarity (ResNet-56). Multicalibration post-processing and isotonic regression both offer improvements to worst group calibration error. Full results are available in \ref{['sec:all_mcb_complex_data']}.
Figure 5: Impact of reusing all model training data for multicalibration post-processing on HMDA (Left) and CreditDefault (Right) as measured by worst group calibration error (max $\textrm{smECE}$). Results vary; for HMDA, post-processing with reused data essentially performs as well as post-processing by holding out data for all models except random forest postprocessed with $\texttt{HKRR}$. However, on CreditDefault, we find that data reuse can harm post-processing across the board. Plots for each dataset available in \ref{['sec:reuse-data-full-plots']}.
...and 56 more figures

Theorems & Definitions (1)

Remark 1

When is Multicalibration Post-Processing Necessary?

TL;DR

Abstract

When is Multicalibration Post-Processing Necessary?

Authors

TL;DR

Abstract

Table of Contents

Figures (61)

Theorems & Definitions (1)