A Model Ensemble-Based Post-Processing Framework for Fairness-Aware Prediction

Zhouting Zhao; Tin Lok James Ng

A Model Ensemble-Based Post-Processing Framework for Fairness-Aware Prediction

Zhouting Zhao, Tin Lok James Ng

Abstract

Striking an optimal balance between predictive performance and fairness continues to be a fundamental challenge in machine learning. In this work, we propose a post-processing framework that facilitates fairness-aware prediction by leveraging model ensembling. Designed to operate independently of any specific model internals, our approach is widely applicable across various learning tasks, model architectures, and fairness definitions. Through extensive experiments spanning classification, regression, and survival analysis, we demonstrate that the framework effectively enhances fairness while maintaining, or only minimally affecting, predictive accuracy.

A Model Ensemble-Based Post-Processing Framework for Fairness-Aware Prediction

Abstract

Paper Structure (22 sections, 18 equations, 10 figures, 2 tables)

This paper contains 22 sections, 18 equations, 10 figures, 2 tables.

Introduction
Background and Related Work
Post-processing for Fairness
Fairness Metrics
Mixture of Experts and Ensemble Learning
An Ensemble Approach to Fairness, One Pre-trained Model
General Case
Applications
An Ensemble Approach to Fairness, Two Pre-trained Models
Experimental Setting and Results
Datasets
Tasks and Evaluation Metrics
Model Configurations
Implementation and Baselines
Core Results
...and 7 more sections

Figures (10)

Figure 1: Performance and fairness trade-offs on the Adult dataset across varying performance models and sensitive-attribute settings. The figure consists of 20 panels arranged in a 5 $\times$ 4 grid. Rows represent different model configurations: (Row 1) 1-pretrained Mixture, (Row 2) 1-pretrained MoE, (Row 3) 2-pretrained Mixture, (Row 4) 2-pretrained MoE, and (Row 5) the FRAPPÉ baseline. Columns correspond to experimental settings: (Col 1) RF as performance model, sensitive attribute: sex; (Col 2) MLP as performance model, sensitive attribute: sex; (Col 3) RF as performance model, sensitive attributes: sex + race; (Col 4) MLP as performance model, sensitive attributes: sex + race. Tested $\lambda$ values: (Col 1) [0.01, 0.5, 1, 5, 10, 100, 200, 300, 500], (Col 2) [0.01, 0.05, 1, 5, 10, 100, 500], (Col 3) [0.01, 0.05, 1, 5, 10, 100], (Col 4) [0.01, 0.05, 1, 5, 10, 100, 500].
Figure 2: Performance and fairness trade-offs on the COMPAS dataset across varying performance models and sensitive-attribute settings. The figure consists of 20 panels arranged in a 5 $\times$ 4 grid. Rows correspond to model configurations: (Row 1) 1-pretrained Mixture, (Row 2) 1-pretrained MoE, (Row 3) 2-pretrained Mixture, (Row 4) 2-pretrained MoE, and (Row 5) the FRAPPÉ baseline. Columns correspond to experimental settings: (Col 1) RF as performance model, sensitive attribute: gender; (Col 2) MLP as performance model, sensitive attribute: gender; (Col 3) RF as performance model, sensitive attributes: gender + race; (Col 4) MLP as performance model, sensitive attributes: gender + race. Tested $\lambda$ values: (Cols 1--2) [1, 5, 10, 100, 200, 300, 500, 700, 1000]; (Cols 3--4) [0.01, 0.05, 0.1, 0.2, 0.5, 1, 5, 10, 30, 50].
Figure 3: Performance and fairness trade-offs on the Heart dataset across different model configurations and performance models. The figure contains 10 panels arranged in a 5 × 2 grid. Rows correspond to model configurations: (Row 1) 1-pretrained Mixture, (Row 2) 1-pretrained MoE, (Row 3) 2-pretrained Mixture, (Row 4) 2-pretrained MoE, and (Row 5) the FRAPPÉ baseline. Columns correspond to performance models: (Col 1) RF with sensitive attribute gender; (Col 2) MLP with sensitive attribute gender. Tested $\lambda$ values: (RF) [0.01, 0.5, 1, 5, 10, 20, 50, 100, 200, 300, 500, 600]; (MLP) [0.01, 0.5, 1, 5, 10].
Figure 4: Performance and fairness trade-offs on the German dataset across different model configurations and performance models. The figure contains 10 panels arranged in a 5 × 2 grid. Rows correspond to model configurations: (Row 1) 1-pretrained Mixture, (Row 2) 1-pretrained MoE, (Row 3) 2-pretrained Mixture, (Row 4) 2-pretrained MoE, and (Row 5) the FRAPPÉ baseline. Columns correspond to performance models: (Col 1) RF with sensitive attribute gender; (Col 2) MLP with sensitive attribute gender. Tested $\lambda$ values: [0.01, 0.1, 0.5, 1, 5, 10, 20].
Figure 5: Performance and fairness trade-offs on the Insurance dataset. The figure contains five panels. Shown from left to right: 1-pretrained Mixture, 1-pretrained MoE, 2-pretrained Mixture, 2-pretrained MoE, and the FRAPPÉ baseline. The performance model used is Random Forest with sensitive attribute gender. Tested $\lambda$ values: [0.001, 0.005, 0.01, 0.02, 0.03, 0.04, 0.05, 0.1, 0.5, 1, 5, 10, 20, 50].
...and 5 more figures

A Model Ensemble-Based Post-Processing Framework for Fairness-Aware Prediction

Abstract

A Model Ensemble-Based Post-Processing Framework for Fairness-Aware Prediction

Authors

Abstract

Table of Contents

Figures (10)