Table of Contents
Fetching ...

Benchmarking the Fairness of Image Upsampling Methods

Mike Laszkiewicz, Imant Daunhawer, Julia E. Vogt, Asja Fischer, Johannes Lederer

TL;DR

This work tackles fairness in conditional generative models for image upsampling by introducing a unified benchmarking framework that couples performance with fairness and diversity metrics. It defines rigorous fairness quantities (RDP, PR, CPR, and UCPR) and corresponding divergences to quantify violations, and demonstrates these on five upsampling methods using UnfairFace, a biased subset of FairFace. The empirical study reveals that none of the methods achieves statistical fairness, with results highly sensitive to training data bias; post-hoc improvements are possible (e.g., fair-pSp, training on FairFace) but do not reach full fairness, highlighting the need for careful data curation and evaluation. A reproducible pipeline and UnfairFace dataset are provided to standardize future assessments of conditional generative models in fairness-relevant downstream tasks.

Abstract

Recent years have witnessed a rapid development of deep generative models for creating synthetic media, such as images and videos. While the practical applications of these models in everyday tasks are enticing, it is crucial to assess the inherent risks regarding their fairness. In this work, we introduce a comprehensive framework for benchmarking the performance and fairness of conditional generative models. We develop a set of metrics$\unicode{x2013}$inspired by their supervised fairness counterparts$\unicode{x2013}$to evaluate the models on their fairness and diversity. Focusing on the specific application of image upsampling, we create a benchmark covering a wide variety of modern upsampling methods. As part of the benchmark, we introduce UnfairFace, a subset of FairFace that replicates the racial distribution of common large-scale face datasets. Our empirical study highlights the importance of using an unbiased training set and reveals variations in how the algorithms respond to dataset imbalances. Alarmingly, we find that none of the considered methods produces statistically fair and diverse results. All experiments can be reproduced using our provided repository.

Benchmarking the Fairness of Image Upsampling Methods

TL;DR

This work tackles fairness in conditional generative models for image upsampling by introducing a unified benchmarking framework that couples performance with fairness and diversity metrics. It defines rigorous fairness quantities (RDP, PR, CPR, and UCPR) and corresponding divergences to quantify violations, and demonstrates these on five upsampling methods using UnfairFace, a biased subset of FairFace. The empirical study reveals that none of the methods achieves statistical fairness, with results highly sensitive to training data bias; post-hoc improvements are possible (e.g., fair-pSp, training on FairFace) but do not reach full fairness, highlighting the need for careful data curation and evaluation. A reproducible pipeline and UnfairFace dataset are provided to standardize future assessments of conditional generative models in fairness-relevant downstream tasks.

Abstract

Recent years have witnessed a rapid development of deep generative models for creating synthetic media, such as images and videos. While the practical applications of these models in everyday tasks are enticing, it is crucial to assess the inherent risks regarding their fairness. In this work, we introduce a comprehensive framework for benchmarking the performance and fairness of conditional generative models. We develop a set of metricsinspired by their supervised fairness counterpartsto evaluate the models on their fairness and diversity. Focusing on the specific application of image upsampling, we create a benchmark covering a wide variety of modern upsampling methods. As part of the benchmark, we introduce UnfairFace, a subset of FairFace that replicates the racial distribution of common large-scale face datasets. Our empirical study highlights the importance of using an unbiased training set and reveals variations in how the algorithms respond to dataset imbalances. Alarmingly, we find that none of the considered methods produces statistically fair and diverse results. All experiments can be reproduced using our provided repository.
Paper Structure (33 sections, 1 theorem, 20 equations, 21 figures, 5 tables)

This paper contains 33 sections, 1 theorem, 20 equations, 21 figures, 5 tables.

Key Result

Corollary 3.1

A conditional generative model satisfies

Figures (21)

  • Figure 1: Upsampling results for models trained on UnfairFace and FairFace using test samples categorized as "Black".
  • Figure 2: Comparing the race reconstruction loss $L_{\operatorname{race}}^{\text{0-1}}(x_{\operatorname{HR}}, \hat{x}_{\operatorname{HR}})$ if $x_{\operatorname{HR}} \in C_j$ for varying $C_j$. Lower scores indicate a better reconstruction.
  • Figure 3: Upsampling results for models trained on UnfairFace and FairFace using uninformative test samples. The real image is an average over images classified as "White".
  • Figure 4: Comparing the uninformative conditional proportional representation distribution $\mathbb{P}_{\operatorname{UCPR}}$ of models trained on UnfairFace and FairFace. The horizontal dashed line indicates the bar height corresponding to a uniform distribution.
  • Figure 5: Given a low-resolution input ("White", "Black", or "Asian"), we assume that the class distributions of the reconstructions are given by the following probability mass functions.
  • ...and 16 more figures

Theorems & Definitions (2)

  • Corollary 3.1
  • Example 9.1