Mitigating the Multiplicity Burden: The Role of Calibration in Reducing Predictive Multiplicity of Classifiers

Mustafa Cavus

Mitigating the Multiplicity Burden: The Role of Calibration in Reducing Predictive Multiplicity of Classifiers

Mustafa Cavus

Abstract

As machine learning models are increasingly deployed in high-stakes environments, ensuring both probabilistic reliability and prediction stability has become critical. This paper examines the interplay between classification calibration and predictive multiplicity - the phenomenon in which multiple near-optimal models within the Rashomon set yield conflicting credit outcomes for the same applicant. Using nine diverse credit risk benchmark datasets, we investigate whether predictive multiplicity concentrates in regions of low predictive confidence and how post-hoc calibration can mitigate algorithmic arbitrariness. Our empirical analysis reveals that minority class observations bear a disproportionate multiplicity burden, as confirmed by significant disparities in predictive multiplicity and prediction confidence. Furthermore, our empirical comparisons indicate that applying post-hoc calibration methods - specifically Platt Scaling, Isotonic Regression, and Temperature Scaling - is associated with lower obscurity across the Rashomon set. Among the tested techniques, Platt Scaling and Isotonic Regression provide the most robust reduction in predictive multiplicity. These findings suggest that calibration can function as a consensus-enforcing layer and may support procedural fairness by mitigating predictive multiplicity.

Mitigating the Multiplicity Burden: The Role of Calibration in Reducing Predictive Multiplicity of Classifiers

Abstract

Paper Structure (19 sections, 8 equations, 3 figures, 2 tables)

This paper contains 19 sections, 8 equations, 3 figures, 2 tables.

Introduction
Related Work
Methodology
Rashomon Effect and Predictive Multiplicity
Ambiguity
Discrepancy
Obscurity
Post-hoc Calibration Methods
Platt Scaling
Isotonic Regression
Temperature Scaling
Experiments
Setup
Data
Results
...and 4 more sections

Figures (3)

Figure 1: Workflow of the experimental methodology.
Figure 2: Relationship between Prediction Confidence and Obscurity across nine credit scoring datasets for majority-minority class distributions.
Figure 3: The impact of various post-hoc calibration methods on obscurity and confidence scores. The bar plots represent the grand mean across nine datasets, while the overlaid strip charts show individual dataset outcomes. Error bars indicate the standard error of the mean.

Mitigating the Multiplicity Burden: The Role of Calibration in Reducing Predictive Multiplicity of Classifiers

Abstract

Mitigating the Multiplicity Burden: The Role of Calibration in Reducing Predictive Multiplicity of Classifiers

Authors

Abstract

Table of Contents

Figures (3)