Table of Contents
Fetching ...

A Model Ensemble-Based Post-Processing Framework for Fairness-Aware Prediction

Zhouting Zhao, Tin Lok James Ng

Abstract

Striking an optimal balance between predictive performance and fairness continues to be a fundamental challenge in machine learning. In this work, we propose a post-processing framework that facilitates fairness-aware prediction by leveraging model ensembling. Designed to operate independently of any specific model internals, our approach is widely applicable across various learning tasks, model architectures, and fairness definitions. Through extensive experiments spanning classification, regression, and survival analysis, we demonstrate that the framework effectively enhances fairness while maintaining, or only minimally affecting, predictive accuracy.

A Model Ensemble-Based Post-Processing Framework for Fairness-Aware Prediction

Abstract

Striking an optimal balance between predictive performance and fairness continues to be a fundamental challenge in machine learning. In this work, we propose a post-processing framework that facilitates fairness-aware prediction by leveraging model ensembling. Designed to operate independently of any specific model internals, our approach is widely applicable across various learning tasks, model architectures, and fairness definitions. Through extensive experiments spanning classification, regression, and survival analysis, we demonstrate that the framework effectively enhances fairness while maintaining, or only minimally affecting, predictive accuracy.
Paper Structure (22 sections, 18 equations, 10 figures, 2 tables)

This paper contains 22 sections, 18 equations, 10 figures, 2 tables.

Figures (10)

  • Figure 1: Performance and fairness trade-offs on the Adult dataset across varying performance models and sensitive-attribute settings. The figure consists of 20 panels arranged in a 5 $\times$ 4 grid. Rows represent different model configurations: (Row 1) 1-pretrained Mixture, (Row 2) 1-pretrained MoE, (Row 3) 2-pretrained Mixture, (Row 4) 2-pretrained MoE, and (Row 5) the FRAPPÉ baseline. Columns correspond to experimental settings: (Col 1) RF as performance model, sensitive attribute: sex; (Col 2) MLP as performance model, sensitive attribute: sex; (Col 3) RF as performance model, sensitive attributes: sex + race; (Col 4) MLP as performance model, sensitive attributes: sex + race. Tested $\lambda$ values: (Col 1) [0.01, 0.5, 1, 5, 10, 100, 200, 300, 500], (Col 2) [0.01, 0.05, 1, 5, 10, 100, 500], (Col 3) [0.01, 0.05, 1, 5, 10, 100], (Col 4) [0.01, 0.05, 1, 5, 10, 100, 500].
  • Figure 2: Performance and fairness trade-offs on the COMPAS dataset across varying performance models and sensitive-attribute settings. The figure consists of 20 panels arranged in a 5 $\times$ 4 grid. Rows correspond to model configurations: (Row 1) 1-pretrained Mixture, (Row 2) 1-pretrained MoE, (Row 3) 2-pretrained Mixture, (Row 4) 2-pretrained MoE, and (Row 5) the FRAPPÉ baseline. Columns correspond to experimental settings: (Col 1) RF as performance model, sensitive attribute: gender; (Col 2) MLP as performance model, sensitive attribute: gender; (Col 3) RF as performance model, sensitive attributes: gender + race; (Col 4) MLP as performance model, sensitive attributes: gender + race. Tested $\lambda$ values: (Cols 1--2) [1, 5, 10, 100, 200, 300, 500, 700, 1000]; (Cols 3--4) [0.01, 0.05, 0.1, 0.2, 0.5, 1, 5, 10, 30, 50].
  • Figure 3: Performance and fairness trade-offs on the Heart dataset across different model configurations and performance models. The figure contains 10 panels arranged in a 5 × 2 grid. Rows correspond to model configurations: (Row 1) 1-pretrained Mixture, (Row 2) 1-pretrained MoE, (Row 3) 2-pretrained Mixture, (Row 4) 2-pretrained MoE, and (Row 5) the FRAPPÉ baseline. Columns correspond to performance models: (Col 1) RF with sensitive attribute gender; (Col 2) MLP with sensitive attribute gender. Tested $\lambda$ values: (RF) [0.01, 0.5, 1, 5, 10, 20, 50, 100, 200, 300, 500, 600]; (MLP) [0.01, 0.5, 1, 5, 10].
  • Figure 4: Performance and fairness trade-offs on the German dataset across different model configurations and performance models. The figure contains 10 panels arranged in a 5 × 2 grid. Rows correspond to model configurations: (Row 1) 1-pretrained Mixture, (Row 2) 1-pretrained MoE, (Row 3) 2-pretrained Mixture, (Row 4) 2-pretrained MoE, and (Row 5) the FRAPPÉ baseline. Columns correspond to performance models: (Col 1) RF with sensitive attribute gender; (Col 2) MLP with sensitive attribute gender. Tested $\lambda$ values: [0.01, 0.1, 0.5, 1, 5, 10, 20].
  • Figure 5: Performance and fairness trade-offs on the Insurance dataset. The figure contains five panels. Shown from left to right: 1-pretrained Mixture, 1-pretrained MoE, 2-pretrained Mixture, 2-pretrained MoE, and the FRAPPÉ baseline. The performance model used is Random Forest with sensitive attribute gender. Tested $\lambda$ values: [0.001, 0.005, 0.01, 0.02, 0.03, 0.04, 0.05, 0.1, 0.5, 1, 5, 10, 20, 50].
  • ...and 5 more figures