Table of Contents
Fetching ...

The Role of Hyperparameters in Predictive Multiplicity

Mustafa Cavus, Katarzyna Woźnica, Przemysław Biecek

TL;DR

The paper investigates how hyperparameter tuning drives predictive multiplicity across six tabular models on 21 binary benchmarks, highlighting the risk of inconsistent predictions in high-stakes settings. It formalizes this with precise discrepancy and tunability measures, including $\delta^{D_p}_{\bm{\bar{\theta}}}(f)$, $\delta_{\bm{\bar{\theta}}^{[h]}}(f)$, $\delta_{\bm{\bar{\theta}}^{[h_1,h_2]}}(f)$, and $d(f_{\theta^{(h)}})$, and evaluates them over roughly $5\times 10^5$ hyperparameter configurations using F1-score. The findings show that hyperparameters such as $\lambda$ in Elastic Net, $\gamma$ in SVM, $cp$ and $maxdepth$ in Decision Trees, and $\alpha$ in XGBoost can drive substantial prediction variability, with XGB exhibiting the highest discrepancy despite notable tunability. The results emphasize a fundamental trade-off between performance gains and prediction consistency and suggest leveraging explainable subspace analyses to navigate multiplicity, ultimately supporting fairer and more transparent decision-making in practical deployments.

Abstract

This paper investigates the critical role of hyperparameters in predictive multiplicity, where different machine learning models trained on the same dataset yield divergent predictions for identical inputs. These inconsistencies can seriously impact high-stakes decisions such as credit assessments, hiring, and medical diagnoses. Focusing on six widely used models for tabular data - Elastic Net, Decision Tree, k-Nearest Neighbor, Support Vector Machine, Random Forests, and Extreme Gradient Boosting - we explore how hyperparameter tuning influences predictive multiplicity, as expressed by the distribution of prediction discrepancies across benchmark datasets. Key hyperparameters such as lambda in Elastic Net, gamma in Support Vector Machines, and alpha in Extreme Gradient Boosting play a crucial role in shaping predictive multiplicity, often compromising the stability of predictions within specific algorithms. Our experiments on 21 benchmark datasets reveal that tuning these hyperparameters leads to notable performance improvements but also increases prediction discrepancies, with Extreme Gradient Boosting exhibiting the highest discrepancy and substantial prediction instability. This highlights the trade-off between performance optimization and prediction consistency, raising concerns about the risk of arbitrary predictions. These findings provide insight into how hyperparameter optimization leads to predictive multiplicity. While predictive multiplicity allows prioritizing domain-specific objectives such as fairness and reduces reliance on a single model, it also complicates decision-making, potentially leading to arbitrary or unjustified outcomes.

The Role of Hyperparameters in Predictive Multiplicity

TL;DR

The paper investigates how hyperparameter tuning drives predictive multiplicity across six tabular models on 21 binary benchmarks, highlighting the risk of inconsistent predictions in high-stakes settings. It formalizes this with precise discrepancy and tunability measures, including , , , and , and evaluates them over roughly hyperparameter configurations using F1-score. The findings show that hyperparameters such as in Elastic Net, in SVM, and in Decision Trees, and in XGBoost can drive substantial prediction variability, with XGB exhibiting the highest discrepancy despite notable tunability. The results emphasize a fundamental trade-off between performance gains and prediction consistency and suggest leveraging explainable subspace analyses to navigate multiplicity, ultimately supporting fairer and more transparent decision-making in practical deployments.

Abstract

This paper investigates the critical role of hyperparameters in predictive multiplicity, where different machine learning models trained on the same dataset yield divergent predictions for identical inputs. These inconsistencies can seriously impact high-stakes decisions such as credit assessments, hiring, and medical diagnoses. Focusing on six widely used models for tabular data - Elastic Net, Decision Tree, k-Nearest Neighbor, Support Vector Machine, Random Forests, and Extreme Gradient Boosting - we explore how hyperparameter tuning influences predictive multiplicity, as expressed by the distribution of prediction discrepancies across benchmark datasets. Key hyperparameters such as lambda in Elastic Net, gamma in Support Vector Machines, and alpha in Extreme Gradient Boosting play a crucial role in shaping predictive multiplicity, often compromising the stability of predictions within specific algorithms. Our experiments on 21 benchmark datasets reveal that tuning these hyperparameters leads to notable performance improvements but also increases prediction discrepancies, with Extreme Gradient Boosting exhibiting the highest discrepancy and substantial prediction instability. This highlights the trade-off between performance optimization and prediction consistency, raising concerns about the risk of arbitrary predictions. These findings provide insight into how hyperparameter optimization leads to predictive multiplicity. While predictive multiplicity allows prioritizing domain-specific objectives such as fairness and reduces reliance on a single model, it also complicates decision-making, potentially leading to arbitrary or unjustified outcomes.

Paper Structure

This paper contains 19 sections, 6 equations, 4 figures, 2 tables.

Figures (4)

  • Figure 1: The distribution of the predictive multiplicity of the models in terms of discrepancy for the defaults. The discrepancy per model is calculated as the maximum difference between a model trained on the default and tuned hyperparameters on each dataset as in Equation \ref{['eq:mult_of_a_model']}.
  • Figure 2: The distribution of the discrepancy of the hyperparameters of the models. The discrepancy per hyperparameters is calculated as the maximum difference between a model trained on the other hyperparameter values, which are fixed to their default values, and the all considered values of the interested hyperparameter on each dataset as in Equation \ref{['eq:mult_of_a_hyp']}.
  • Figure 3: Relationship between predictive multiplicity (measured as discrepancy) and model performance (measured as F1). Each panel shows results for a different dataset. The colors denote the models under consideration. Each dot corresponds to a different set of hyperparameters. Datasets were selected for which variability in discrepancy is observed for models of the same performance.
  • Figure 4: The joint distribution of hyperparameter combinations on discrepancy and model tunability is analyzed. The discrepancy for each hyperparameter joint is calculated as the mean difference between a model trained with default settings for other hyperparameters and one trained with all considered values of the hyperparameter joint, as described in Equation \ref{['eq:mult_of_joint']}. We applied a bivariate classification, scaling the mean F1-score and mean discrepancy into a three-by-three grid using the 'equal style,' which divides both variables into three equal-range categories for a balanced distribution. This allows clear visualization of the trade-off between performance and prediction discrepancy in the bivariate heatmaps.