f-FERM: A Scalable Framework for Robust Fair Empirical Risk Minimization

Sina Baharlouei; Shivam Patel; Meisam Razaviyayn

f-FERM: A Scalable Framework for Robust Fair Empirical Risk Minimization

Sina Baharlouei, Shivam Patel, Meisam Razaviyayn

TL;DR

This work addresses the challenge of scalable, fair empirical risk minimization in settings with potential distribution shifts. It introduces f-FERM, a unified stochastic framework that regularizes the ERM objective with $f$-divergences, exploiting Legendre-Fenchel duality to obtain unbiased mini-batch gradients and a convergent SGDA algorithm with a provable iteration complexity. The authors extend f-FERM to distribution shifts via distributionally robust optimization, deriving two regimes: small shifts with $\\ell_p$ uncertainty sets and large shifts with $\\ell_{\\infty}$ uncertainty, each leading to tractable optimization procedures and memory-efficient implementations. Empirical results on standard fairness benchmarks (e.g., DP) demonstrate competitive fairness-accuracy tradeoffs across batch sizes, and robust performance under distribution shifts, highlighting practical utility for deploying fair ML systems without reliance on causal graphs.

Abstract

Training and deploying machine learning models that meet fairness criteria for protected groups are fundamental in modern artificial intelligence. While numerous constraints and regularization terms have been proposed in the literature to promote fairness in machine learning tasks, most of these methods are not amenable to stochastic optimization due to the complex and nonlinear structure of constraints and regularizers. Here, the term "stochastic" refers to the ability of the algorithm to work with small mini-batches of data. Motivated by the limitation of existing literature, this paper presents a unified stochastic optimization framework for fair empirical risk minimization based on f-divergence measures (f-FERM). The proposed stochastic algorithm enjoys theoretical convergence guarantees. In addition, our experiments demonstrate the superiority of fairness-accuracy tradeoffs offered by f-FERM for almost all batch sizes (ranging from full-batch to batch size of one). Moreover, we show that our framework can be extended to the case where there is a distribution shift from training to the test data. Our extension is based on a distributionally robust optimization reformulation of f-FERM objective under $L_p$ norms as uncertainty sets. Again, in this distributionally robust setting, f-FERM not only enjoys theoretical convergence guarantees but also outperforms other baselines in the literature in the tasks involving distribution shifts. An efficient stochastic implementation of $f$-FERM is publicly available.

f-FERM: A Scalable Framework for Robust Fair Empirical Risk Minimization

TL;DR

-divergences, exploiting Legendre-Fenchel duality to obtain unbiased mini-batch gradients and a convergent SGDA algorithm with a provable iteration complexity. The authors extend f-FERM to distribution shifts via distributionally robust optimization, deriving two regimes: small shifts with

uncertainty sets and large shifts with

uncertainty, each leading to tractable optimization procedures and memory-efficient implementations. Empirical results on standard fairness benchmarks (e.g., DP) demonstrate competitive fairness-accuracy tradeoffs across batch sizes, and robust performance under distribution shifts, highlighting practical utility for deploying fair ML systems without reliance on causal graphs.

Abstract

norms as uncertainty sets. Again, in this distributionally robust setting, f-FERM not only enjoys theoretical convergence guarantees but also outperforms other baselines in the literature in the tasks involving distribution shifts. An efficient stochastic implementation of

-FERM is publicly available.

Paper Structure (21 sections, 19 theorems, 69 equations, 5 figures, 1 table, 3 algorithms)

This paper contains 21 sections, 19 theorems, 69 equations, 5 figures, 1 table, 3 algorithms.

Introduction
Fair Empirical Risk Minimization via $f$-divergences
A Convergent Stochastic Algorithm for fair ERM via $f$-Divergences
Robust $f$-FERM in the Presence of Distribution Shifts
Robust $f$-FERM Under $\ell_p$ Norms and Small Distribution Shifts
Robust $f$-FERM Under $\ell_{\infty}$ Norms and Potentially Large Distribution Shifts
Experiments
Fairness-Accuracy Tradeoffs on Benchmark Datasets
Fairness-Accuracy Tradeoffs in the Presence of the Distribution Shift
Conclusion
$f$-FERM for other notions of group fairness
$f$-divergences for continuous sensitive attributes and target variables
$f$-divergences cover well-known notions of fairness violation
Proof of Proposition \ref{['thm: variational']}
Derivation of Closed-Form Expressions for Unbiased Gradient Estimators of $f$-Divergences
...and 6 more sections

Key Result

Proposition 2.1

Let $f(\cdot)$ be a convex function. Then, eq: f-FERM can be reformulated as: where $f^{*}(z) = \sup_{w \in \textup{dom}(f)} w^T z - f(w)$ is the Legendre-Fenchel transformation of the function $f$.

Figures (5)

Figure 1: Performance of different $f$-divergences as the regularizers. The experiment is on the adult dataset with gender and race as sensitive attributes. While the offered tradeoffs are close to each other for small demographic parity violations, KL-divergence shows an extraordinary performance for a low-fairness high-accuracy regime. We do not display the performance for larger batch sizes or when only one sensitive attribute is available due to the insignificant difference between the performance of different $f$-divergences.
Figure 2: Performance of the trained fair models on Adult Dataset with gender and race as two sensitive attributes with different Batch-sizes. The red dashed line represents the Naïve baseline where the model outputs zero with probability $p$. By increasing $p$, the model becomes fairer at the cost of the loss in accuracy.
Figure 3: Performance of different state-of-the-art approaches and our two methods for handling distribution shift. The dataset is adult, and the sensitive attribute is gender. We randomly flip the label of a proportion of gender entries (from $0$ to $20\%$). As we observe, our approach demonstrates more robustness against the drop in DP violation compared to other approaches.
Figure 4: Performance of the trained fair models on new Adult Dataset. The model is trained on one state (California or Texas) and evaluated in $50$ states. The distribution of each state dataset is different than others. Thus, the IID assumption does not hold among datasets of different states.
Figure 5: Performance of the trained fair models on COMPAS and German Credit Datasets.

Theorems & Definitions (35)

Proposition 2.1
proof
Theorem 2.2
proof
Proposition 3.1
Proposition C.1
Proposition C.2
proof
Proposition C.3
proof
...and 25 more

f-FERM: A Scalable Framework for Robust Fair Empirical Risk Minimization

TL;DR

Abstract

f-FERM: A Scalable Framework for Robust Fair Empirical Risk Minimization

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (5)

Theorems & Definitions (35)