Benchmarking the Fairness of Image Upsampling Methods

Mike Laszkiewicz; Imant Daunhawer; Julia E. Vogt; Asja Fischer; Johannes Lederer

Benchmarking the Fairness of Image Upsampling Methods

Mike Laszkiewicz, Imant Daunhawer, Julia E. Vogt, Asja Fischer, Johannes Lederer

TL;DR

This work tackles fairness in conditional generative models for image upsampling by introducing a unified benchmarking framework that couples performance with fairness and diversity metrics. It defines rigorous fairness quantities (RDP, PR, CPR, and UCPR) and corresponding divergences to quantify violations, and demonstrates these on five upsampling methods using UnfairFace, a biased subset of FairFace. The empirical study reveals that none of the methods achieves statistical fairness, with results highly sensitive to training data bias; post-hoc improvements are possible (e.g., fair-pSp, training on FairFace) but do not reach full fairness, highlighting the need for careful data curation and evaluation. A reproducible pipeline and UnfairFace dataset are provided to standardize future assessments of conditional generative models in fairness-relevant downstream tasks.

Abstract

Recent years have witnessed a rapid development of deep generative models for creating synthetic media, such as images and videos. While the practical applications of these models in everyday tasks are enticing, it is crucial to assess the inherent risks regarding their fairness. In this work, we introduce a comprehensive framework for benchmarking the performance and fairness of conditional generative models. We develop a set of metrics$\unicode{x2013}$inspired by their supervised fairness counterparts$\unicode{x2013}$to evaluate the models on their fairness and diversity. Focusing on the specific application of image upsampling, we create a benchmark covering a wide variety of modern upsampling methods. As part of the benchmark, we introduce UnfairFace, a subset of FairFace that replicates the racial distribution of common large-scale face datasets. Our empirical study highlights the importance of using an unbiased training set and reveals variations in how the algorithms respond to dataset imbalances. Alarmingly, we find that none of the considered methods produces statistically fair and diverse results. All experiments can be reproduced using our provided repository.

Benchmarking the Fairness of Image Upsampling Methods

TL;DR

Abstract

inspired by their supervised fairness counterparts

to evaluate the models on their fairness and diversity. Focusing on the specific application of image upsampling, we create a benchmark covering a wide variety of modern upsampling methods. As part of the benchmark, we introduce UnfairFace, a subset of FairFace that replicates the racial distribution of common large-scale face datasets. Our empirical study highlights the importance of using an unbiased training set and reveals variations in how the algorithms respond to dataset imbalances. Alarmingly, we find that none of the considered methods produces statistically fair and diverse results. All experiments can be reproduced using our provided repository.

Paper Structure (33 sections, 1 theorem, 20 equations, 21 figures, 5 tables)

This paper contains 33 sections, 1 theorem, 20 equations, 21 figures, 5 tables.

Introduction
Notation
Related Work
Fairness for Supervised Models and Unconditional Generative Models
Fairness for Conditional Generative Models
Image Upsampling Methods
Benchmarking Fairness of Conditional Generative Models
Performance
Fairness and Diversity
Introducing UnfairFace
Experiments
Experimental setup
Qualitative Results
Upsampling Performance
Fairness and Diversity
...and 18 more sections

Key Result

Corollary 3.1

A conditional generative model satisfies

Figures (21)

Figure 1: Upsampling results for models trained on UnfairFace and FairFace using test samples categorized as "Black".
Figure 2: Comparing the race reconstruction loss $L_{\operatorname{race}}^{\text{0-1}}(x_{\operatorname{HR}}, \hat{x}_{\operatorname{HR}})$ if $x_{\operatorname{HR}} \in C_j$ for varying $C_j$. Lower scores indicate a better reconstruction.
Figure 3: Upsampling results for models trained on UnfairFace and FairFace using uninformative test samples. The real image is an average over images classified as "White".
Figure 4: Comparing the uninformative conditional proportional representation distribution $\mathbb{P}_{\operatorname{UCPR}}$ of models trained on UnfairFace and FairFace. The horizontal dashed line indicates the bar height corresponding to a uniform distribution.
Figure 5: Given a low-resolution input ("White", "Black", or "Asian"), we assume that the class distributions of the reconstructions are given by the following probability mass functions.
...and 16 more figures

Theorems & Definitions (2)

Corollary 3.1
Example 9.1

Benchmarking the Fairness of Image Upsampling Methods

TL;DR

Abstract

Benchmarking the Fairness of Image Upsampling Methods

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (21)

Theorems & Definitions (2)