Table of Contents
Fetching ...

OxEnsemble: Fair Ensembles for Low-Data Classification

Jonathan Rystrøm, Zihao Fu, Chris Russell

TL;DR

OxEnsemble tackles fair classification in scarce-data regimes by training an ensemble of deep networks with per-member fairness constraints on held-out data and aggregating via majority voting. The authors prove theoretical guarantees for minimum-rate and error-parity fairness under restricted competence and provide data-size guidance for observing these guarantees in practice. Empirically, OxEnsemble delivers superior fairness–accuracy trade-offs across three medical-imaging datasets (HAM10000, Fitzpatrick17k, FairVLMed) with efficiency benefits from a shared backbone. This work offers a practically impactful, theoretically grounded path to equitable decision-making in high-stakes, data-scarce domains. Code is available at the provided repository link.

Abstract

We address the problem of fair classification in settings where data is scarce and unbalanced across demographic groups. Such low-data regimes are common in domains like medical imaging, where false negatives can have fatal consequences. We propose a novel approach \emph{OxEnsemble} for efficiently training ensembles and enforcing fairness in these low-data regimes. Unlike other approaches, we aggregate predictions across ensemble members, each trained to satisfy fairness constraints. By construction, \emph{OxEnsemble} is both data-efficient, carefully reusing held-out data to enforce fairness reliably, and compute-efficient, requiring little more compute than used to fine-tune or evaluate an existing model. We validate this approach with new theoretical guarantees. Experimentally, our approach yields more consistent outcomes and stronger fairness-accuracy trade-offs than existing methods across multiple challenging medical imaging classification datasets.

OxEnsemble: Fair Ensembles for Low-Data Classification

TL;DR

OxEnsemble tackles fair classification in scarce-data regimes by training an ensemble of deep networks with per-member fairness constraints on held-out data and aggregating via majority voting. The authors prove theoretical guarantees for minimum-rate and error-parity fairness under restricted competence and provide data-size guidance for observing these guarantees in practice. Empirically, OxEnsemble delivers superior fairness–accuracy trade-offs across three medical-imaging datasets (HAM10000, Fitzpatrick17k, FairVLMed) with efficiency benefits from a shared backbone. This work offers a practically impactful, theoretically grounded path to equitable decision-making in high-stakes, data-scarce domains. Code is available at the provided repository link.

Abstract

We address the problem of fair classification in settings where data is scarce and unbalanced across demographic groups. Such low-data regimes are common in domains like medical imaging, where false negatives can have fatal consequences. We propose a novel approach \emph{OxEnsemble} for efficiently training ensembles and enforcing fairness in these low-data regimes. Unlike other approaches, we aggregate predictions across ensemble members, each trained to satisfy fairness constraints. By construction, \emph{OxEnsemble} is both data-efficient, carefully reusing held-out data to enforce fairness reliably, and compute-efficient, requiring little more compute than used to fine-tune or evaluate an existing model. We validate this approach with new theoretical guarantees. Experimentally, our approach yields more consistent outcomes and stronger fairness-accuracy trade-offs than existing methods across multiple challenging medical imaging classification datasets.

Paper Structure

This paper contains 43 sections, 4 theorems, 26 equations, 7 figures, 6 tables.

Key Result

Lemma 1

Restricted competent ensembles do not degrade recall relative to the average recall of a member.

Figures (7)

  • Figure 1: (a) Comparisons.(b) OxEnsemble pipeline.Train (1): Members share backbone and task + protected attributes. Validate (2): Enforce fairness constraint while maximising accuracy. Predict (3): Majority vote. Partitioning ensures full coverage; shared backbone improves efficiency, and voting provides guarantees.
  • Figure 2: Competence Violations vs Recall. Competence violations ($C_\rho$; 0=perfect) are high when recall<0.5 and stabilize at recall>0.5. Left: Test set for fitting and evaluation. Right: Validation set for fitting, test set for evaluation.
  • Figure 3: Fairness–accuracy AUC (FairAUC) relative to ERM.OxEnsemble achieves higher FairAUC than all baselines on Fitzpatrick17k (left) and HAM10000 (right). Error bars show 95% bootstrap CIs. Evaluation follows ยง \ref{['sec:evaluation']} over minimum-recall thresholds in $[0.5,1]$.
  • Figure 4: Pareto frontiers across datasets.OxEnsemble (green) yields better fairness–accuracy trade-offs than baselines (grey). Left/centre: min recall (HAM10000, Fitzpatrick17k). Right: equal opportunity (FairVLMed). See ยง \ref{['sec:evaluation']} for definitions.
  • Figure 5: Relationship between Ensemble Size (X-axis) and FairAUC (Y-axis) across two datasets. No significant relationship is observed.
  • ...and 2 more figures

Theorems & Definitions (7)

  • Lemma 1
  • proof
  • theorem 1
  • Lemma 2
  • proof
  • Lemma 3
  • proof