Table of Contents
Fetching ...

Benchmarking Bias Mitigation Toward Fairness Without Harm from Vision to LVLMs

Xuwei Tan, Ziyu Hu, Xueru Zhang

TL;DR

NH-Fair presents a tuning-aware, unified benchmark for fairness without harm that spans vision and vision-language models under standardized data, metrics, and training protocols. It introduces a DTO-based two-stage model selection to establish a strong, fairness-conscious ERM baseline and a Fairness-without-Harm (FWH) framework to assess mitigation methods, across seven diverse datasets. The study reveals that many debiasing methods do not outperform a well-tuned ERM and that data augmentation can improve both fairness and utility, while LVLMs, despite higher average accuracy, continue to exhibit subgroup disparities and limited gains from scaling. The work provides a reproducible, tuning-aware pipeline for harm-aware fairness evaluation and highlights practical guidance for practitioners in model selection, training choices, and data-centric strategies. Overall, NH-Fair advances fair evaluation by bridging classical vision and LVLMs, clarifying where gains come from, and motivating more robust, bias-aware developments.

Abstract

Machine learning models trained on real-world data often inherit and amplify biases against certain social groups, raising urgent concerns about their deployment at scale. While numerous bias mitigation methods have been proposed, comparing the effectiveness of bias mitigation methods remains difficult due to heterogeneous datasets, inconsistent fairness metrics, isolated evaluation of vision versus multi-modal models, and insufficient hyperparameter tuning that undermines fair comparisons. We introduce NH-Fair, a unified benchmark for fairness without harm that spans both vision models and large vision-language models (LVLMs) under standardized data, metrics, and training protocols, covering supervised and zero-shot regimes. Our key contributions are: (1) a systematic ERM tuning study that identifies training choices with large influence on both utility and disparities, yielding empirically grounded guidelines to help practitioners reduce expensive hyperparameter tuning space in achieving strong fairness and accuracy; (2) evidence that many debiasing methods do not reliably outperform a well-tuned ERM baseline, whereas a composite data-augmentation method consistently delivers parity gains without sacrificing utility, emerging as a promising practical strategy. (3) an analysis showing that while LVLMs achieve higher average accuracy, they still exhibit subgroup disparities, and gains from scaling are typically smaller than those from architectural or training-protocol choices. NH-Fair provides a reproducible, tuning-aware pipeline for rigorous, harm-aware fairness evaluation.

Benchmarking Bias Mitigation Toward Fairness Without Harm from Vision to LVLMs

TL;DR

NH-Fair presents a tuning-aware, unified benchmark for fairness without harm that spans vision and vision-language models under standardized data, metrics, and training protocols. It introduces a DTO-based two-stage model selection to establish a strong, fairness-conscious ERM baseline and a Fairness-without-Harm (FWH) framework to assess mitigation methods, across seven diverse datasets. The study reveals that many debiasing methods do not outperform a well-tuned ERM and that data augmentation can improve both fairness and utility, while LVLMs, despite higher average accuracy, continue to exhibit subgroup disparities and limited gains from scaling. The work provides a reproducible, tuning-aware pipeline for harm-aware fairness evaluation and highlights practical guidance for practitioners in model selection, training choices, and data-centric strategies. Overall, NH-Fair advances fair evaluation by bridging classical vision and LVLMs, clarifying where gains come from, and motivating more robust, bias-aware developments.

Abstract

Machine learning models trained on real-world data often inherit and amplify biases against certain social groups, raising urgent concerns about their deployment at scale. While numerous bias mitigation methods have been proposed, comparing the effectiveness of bias mitigation methods remains difficult due to heterogeneous datasets, inconsistent fairness metrics, isolated evaluation of vision versus multi-modal models, and insufficient hyperparameter tuning that undermines fair comparisons. We introduce NH-Fair, a unified benchmark for fairness without harm that spans both vision models and large vision-language models (LVLMs) under standardized data, metrics, and training protocols, covering supervised and zero-shot regimes. Our key contributions are: (1) a systematic ERM tuning study that identifies training choices with large influence on both utility and disparities, yielding empirically grounded guidelines to help practitioners reduce expensive hyperparameter tuning space in achieving strong fairness and accuracy; (2) evidence that many debiasing methods do not reliably outperform a well-tuned ERM baseline, whereas a composite data-augmentation method consistently delivers parity gains without sacrificing utility, emerging as a promising practical strategy. (3) an analysis showing that while LVLMs achieve higher average accuracy, they still exhibit subgroup disparities, and gains from scaling are typically smaller than those from architectural or training-protocol choices. NH-Fair provides a reproducible, tuning-aware pipeline for rigorous, harm-aware fairness evaluation.
Paper Structure (44 sections, 2 equations, 12 figures, 22 tables)

This paper contains 44 sections, 2 equations, 12 figures, 22 tables.

Figures (12)

  • Figure 1: Overview of NH-Fair, evaluating fairness across domains, tasks, models, and methods.
  • Figure 2: Two-stage model selection. (a) First, each green dot represents a candidate ERM model’s performance on two sensitive groups. We select the ERM model (orange circle) whose performance is closest to the utopia point (red star) in Euclidean distance. (b) With the ERM baseline established, we classify models trained by bias‐mitigation methods into four zones based on how their subgroup performance compares to the ERM. Starting from the Optimal Zone and moving counterclockwise, we check whether any model is located in the shaded region, which demonstrates improved fairness.
  • Figure 3: Comparative Analysis of Bias Mitigation Methods. OxonFair is excluded from here due to missing FairFace results.
  • Figure 4: Critical difference plots comparing methods across five metrics. Lower ranks indicate better performance. OxonFair is excluded since it does not support multi-class classification on FairFace.
  • Figure 5: LVLM performance using ACC and $100-\text{Gap}$. Other metrics are presented in Figure \ref{['fig:radar_benchmark']}.
  • ...and 7 more figures