The Role of Hyperparameters in Predictive Multiplicity
Mustafa Cavus, Katarzyna Woźnica, Przemysław Biecek
TL;DR
The paper investigates how hyperparameter tuning drives predictive multiplicity across six tabular models on 21 binary benchmarks, highlighting the risk of inconsistent predictions in high-stakes settings. It formalizes this with precise discrepancy and tunability measures, including $\delta^{D_p}_{\bm{\bar{\theta}}}(f)$, $\delta_{\bm{\bar{\theta}}^{[h]}}(f)$, $\delta_{\bm{\bar{\theta}}^{[h_1,h_2]}}(f)$, and $d(f_{\theta^{(h)}})$, and evaluates them over roughly $5\times 10^5$ hyperparameter configurations using F1-score. The findings show that hyperparameters such as $\lambda$ in Elastic Net, $\gamma$ in SVM, $cp$ and $maxdepth$ in Decision Trees, and $\alpha$ in XGBoost can drive substantial prediction variability, with XGB exhibiting the highest discrepancy despite notable tunability. The results emphasize a fundamental trade-off between performance gains and prediction consistency and suggest leveraging explainable subspace analyses to navigate multiplicity, ultimately supporting fairer and more transparent decision-making in practical deployments.
Abstract
This paper investigates the critical role of hyperparameters in predictive multiplicity, where different machine learning models trained on the same dataset yield divergent predictions for identical inputs. These inconsistencies can seriously impact high-stakes decisions such as credit assessments, hiring, and medical diagnoses. Focusing on six widely used models for tabular data - Elastic Net, Decision Tree, k-Nearest Neighbor, Support Vector Machine, Random Forests, and Extreme Gradient Boosting - we explore how hyperparameter tuning influences predictive multiplicity, as expressed by the distribution of prediction discrepancies across benchmark datasets. Key hyperparameters such as lambda in Elastic Net, gamma in Support Vector Machines, and alpha in Extreme Gradient Boosting play a crucial role in shaping predictive multiplicity, often compromising the stability of predictions within specific algorithms. Our experiments on 21 benchmark datasets reveal that tuning these hyperparameters leads to notable performance improvements but also increases prediction discrepancies, with Extreme Gradient Boosting exhibiting the highest discrepancy and substantial prediction instability. This highlights the trade-off between performance optimization and prediction consistency, raising concerns about the risk of arbitrary predictions. These findings provide insight into how hyperparameter optimization leads to predictive multiplicity. While predictive multiplicity allows prioritizing domain-specific objectives such as fairness and reduces reliance on a single model, it also complicates decision-making, potentially leading to arbitrary or unjustified outcomes.
