FairGridSearch: A Framework to Compare Fairness-Enhancing Models
Shih-Chi Ma, Tatiana Ermakova, Benjamin Fabian
TL;DR
FairGridSearch addresses the challenge of selecting fairness-enhancing models for binary classification by providing a grid-search-like framework to compare multiple bias-mitigation methods, base estimators, thresholds, and evaluation metrics. It incorporates cross-validation and a cost-based best-model criterion, defined as $C = \alpha \cdot (1 - metric_{acc}) + \beta \cdot |metric_{fair}|$, with $\alpha$ and $\beta$ set to 1 in experiments, balancing accuracy and fairness. Experiments on the Adult, COMPAS, and German Credit datasets show that metric choice, base estimator, and threshold significantly influence fairness outcomes, with no universal best approach across datasets. The work highlights the need to consider a broad set of factors beyond bias mitigation alone and provides a practical tool for systematic fair-model selection.
Abstract
Machine learning models are increasingly used in critical decision-making applications. However, these models are susceptible to replicating or even amplifying bias present in real-world data. While there are various bias mitigation methods and base estimators in the literature, selecting the optimal model for a specific application remains challenging. This paper focuses on binary classification and proposes FairGridSearch, a novel framework for comparing fairness-enhancing models. FairGridSearch enables experimentation with different model parameter combinations and recommends the best one. The study applies FairGridSearch to three popular datasets (Adult, COMPAS, and German Credit) and analyzes the impacts of metric selection, base estimator choice, and classification threshold on model fairness. The results highlight the significance of selecting appropriate accuracy and fairness metrics for model evaluation. Additionally, different base estimators and classification threshold values affect the effectiveness of bias mitigation methods and fairness stability respectively, but the effects are not consistent across all datasets. Based on these findings, future research on fairness in machine learning should consider a broader range of factors when building fair models, going beyond bias mitigation methods alone.
