Table of Contents
Fetching ...

Fair Generalized Linear Mixed Models

Jan Pablo Burgard, João Vitor Pamplona

TL;DR

The paper addresses fairness in predictive modeling when data are collected via stratified sampling, which induces dependence across observations and can bias outcomes. It introduces Fair Generalized Linear Mixed Models (Fair GLMM) with two solution pathways: a constrained optimization approach analogous to Cluster-Regularized Logistic Regression (Fair CRLR) and a penalized, boosting-like unconstrained variant based on a Lagrange multiplier to handle both fairness and random effects. Through extensive simulations across unfair/fair populations with/without strata effects and an application to Bank marketing data, the authors demonstrate that Fair GLMM and Fair CRLR improve disparate impact (DI) while maintaining or enhancing accuracy compared with standard GLMM, CRLR, and fair LR approaches. A KKT-based sensitivity analysis identifies the most influential sensitive features (notably Housing) on DI, guiding constraint design. The work provides practical guidance on selecting between Fair GLMM and CRLR and extends fairness considerations to hierarchical data contexts, with potential impact on fair decision-making in survey-based and real-world settings.

Abstract

When using machine learning for automated prediction, it is important to account for fairness in the prediction. Fairness in machine learning aims to ensure that biases in the data and model inaccuracies do not lead to discriminatory decisions. E.g., predictions from fair machine learning models should not discriminate against sensitive variables such as sexual orientation and ethnicity. The training data often in obtained from social surveys. In social surveys, oftentimes the data collection process is a strata sampling, e.g. due to cost restrictions. In strata samples, the assumption of independence between the observation is not fulfilled. Hence, if the machine learning models do not account for the strata correlations, the results may be biased. Especially high is the bias in cases where the strata assignment is correlated to the variable of interest. We present in this paper an algorithm that can handle both problems simultaneously, and we demonstrate the impact of stratified sampling on the quality of fair machine learning predictions in a reproducible simulation study.

Fair Generalized Linear Mixed Models

TL;DR

The paper addresses fairness in predictive modeling when data are collected via stratified sampling, which induces dependence across observations and can bias outcomes. It introduces Fair Generalized Linear Mixed Models (Fair GLMM) with two solution pathways: a constrained optimization approach analogous to Cluster-Regularized Logistic Regression (Fair CRLR) and a penalized, boosting-like unconstrained variant based on a Lagrange multiplier to handle both fairness and random effects. Through extensive simulations across unfair/fair populations with/without strata effects and an application to Bank marketing data, the authors demonstrate that Fair GLMM and Fair CRLR improve disparate impact (DI) while maintaining or enhancing accuracy compared with standard GLMM, CRLR, and fair LR approaches. A KKT-based sensitivity analysis identifies the most influential sensitive features (notably Housing) on DI, guiding constraint design. The work provides practical guidance on selecting between Fair GLMM and CRLR and extends fairness considerations to hierarchical data contexts, with potential impact on fair decision-making in survey-based and real-world settings.

Abstract

When using machine learning for automated prediction, it is important to account for fairness in the prediction. Fairness in machine learning aims to ensure that biases in the data and model inaccuracies do not lead to discriminatory decisions. E.g., predictions from fair machine learning models should not discriminate against sensitive variables such as sexual orientation and ethnicity. The training data often in obtained from social surveys. In social surveys, oftentimes the data collection process is a strata sampling, e.g. due to cost restrictions. In strata samples, the assumption of independence between the observation is not fulfilled. Hence, if the machine learning models do not account for the strata correlations, the results may be biased. Especially high is the bias in cases where the strata assignment is correlated to the variable of interest. We present in this paper an algorithm that can handle both problems simultaneously, and we demonstrate the impact of stratified sampling on the quality of fair machine learning predictions in a reproducible simulation study.
Paper Structure (15 sections, 59 equations, 12 figures, 11 tables)