Table of Contents
Fetching ...

Metrics for Dataset Demographic Bias: A Case Study on Facial Expression Recognition

Iris Dominguez-Catena, Daniel Paternain, Mikel Galar

TL;DR

The study addresses the need to quantify dataset-level demographic bias to mitigate downstream model unfairness. It develops a taxonomy of bias metrics spanning representational and stereotypical bias, and validates them via a FER case study across 20 datasets, revealing metric redundancy and guiding the selection of a compact, interpretable metric set. The main contributions are (i) a unified framework linking ecology- and information-theory-inspired metrics to ML datasets, (ii) practical recommendations (ENS+SEI for representational bias; $\phi_C$ and local $Z$ for stereotypical bias), and (iii) empirical insights on FER data sources, with lab datasets being more biased representationally and ITW datasets showing more stereotypical bias. The work supports better dataset curation and bias mitigation strategies, with implications for fairer and more accurate AI systems in vision tasks.

Abstract

Demographic biases in source datasets have been shown as one of the causes of unfairness and discrimination in the predictions of Machine Learning models. One of the most prominent types of demographic bias are statistical imbalances in the representation of demographic groups in the datasets. In this paper, we study the measurement of these biases by reviewing the existing metrics, including those that can be borrowed from other disciplines. We develop a taxonomy for the classification of these metrics, providing a practical guide for the selection of appropriate metrics. To illustrate the utility of our framework, and to further understand the practical characteristics of the metrics, we conduct a case study of 20 datasets used in Facial Emotion Recognition (FER), analyzing the biases present in them. Our experimental results show that many metrics are redundant and that a reduced subset of metrics may be sufficient to measure the amount of demographic bias. The paper provides valuable insights for researchers in AI and related fields to mitigate dataset bias and improve the fairness and accuracy of AI models. The code is available at https://github.com/irisdominguez/dataset_bias_metrics.

Metrics for Dataset Demographic Bias: A Case Study on Facial Expression Recognition

TL;DR

The study addresses the need to quantify dataset-level demographic bias to mitigate downstream model unfairness. It develops a taxonomy of bias metrics spanning representational and stereotypical bias, and validates them via a FER case study across 20 datasets, revealing metric redundancy and guiding the selection of a compact, interpretable metric set. The main contributions are (i) a unified framework linking ecology- and information-theory-inspired metrics to ML datasets, (ii) practical recommendations (ENS+SEI for representational bias; and local for stereotypical bias), and (iii) empirical insights on FER data sources, with lab datasets being more biased representationally and ITW datasets showing more stereotypical bias. The work supports better dataset curation and bias mitigation strategies, with implications for fairer and more accurate AI systems in vision tasks.

Abstract

Demographic biases in source datasets have been shown as one of the causes of unfairness and discrimination in the predictions of Machine Learning models. One of the most prominent types of demographic bias are statistical imbalances in the representation of demographic groups in the datasets. In this paper, we study the measurement of these biases by reviewing the existing metrics, including those that can be borrowed from other disciplines. We develop a taxonomy for the classification of these metrics, providing a practical guide for the selection of appropriate metrics. To illustrate the utility of our framework, and to further understand the practical characteristics of the metrics, we conduct a case study of 20 datasets used in Facial Emotion Recognition (FER), analyzing the biases present in them. Our experimental results show that many metrics are redundant and that a reduced subset of metrics may be sufficient to measure the amount of demographic bias. The paper provides valuable insights for researchers in AI and related fields to mitigate dataset bias and improve the fairness and accuracy of AI models. The code is available at https://github.com/irisdominguez/dataset_bias_metrics.
Paper Structure (26 sections, 18 equations, 13 figures, 6 tables)

This paper contains 26 sections, 18 equations, 13 figures, 6 tables.

Figures (13)

  • Figure 1: Taxonomy of dataset demographic bias metrics.
  • Figure 2: Representational bias metrics for the the three demographic components and the target label. The metrics are calculated as diversity metrics, with higher values corresponding to lower representational bias. The graphical representations of the values are normalized to the maximum value of the row. The datasets are sorted by the average of the normalized metrics.
  • Figure 3: Spearman's $\rho$ agreement between the representational bias metrics, measured independently for each component and then averaged for each pair of metrics. Higher $\rho$ values indicate high coherence between the rankings generated by the metrics.
  • Figure 4: Stereotypical bias metrics for the the three demographic components against the target label. Higher values correspond to higher amounts of stereotypical bias. The graphical representation at each row, corresponding to a single metric and demographic component, is normalized to the maximum value of the row. In the $\phi_C$ row a ${}^\circ$ mark indicates a statistically weak association and a ${}^\triangle$ mark a statistically medium association. The datasets are sorted by the average of the normalized metrics, from lower values (less stereotypical bias) in the left to higher values (more stereotypical bias) in the right.
  • Figure 5: Spearman's $\rho$ agreement between the stereotypical bias metrics, measured independently for each component and then averaged for each pair of metrics. Higher $\rho$ values indicate high coherence between the rankings generated by the metrics.
  • ...and 8 more figures