Benchmarking the Fairness of Image Upsampling Methods
Mike Laszkiewicz, Imant Daunhawer, Julia E. Vogt, Asja Fischer, Johannes Lederer
TL;DR
This work tackles fairness in conditional generative models for image upsampling by introducing a unified benchmarking framework that couples performance with fairness and diversity metrics. It defines rigorous fairness quantities (RDP, PR, CPR, and UCPR) and corresponding divergences to quantify violations, and demonstrates these on five upsampling methods using UnfairFace, a biased subset of FairFace. The empirical study reveals that none of the methods achieves statistical fairness, with results highly sensitive to training data bias; post-hoc improvements are possible (e.g., fair-pSp, training on FairFace) but do not reach full fairness, highlighting the need for careful data curation and evaluation. A reproducible pipeline and UnfairFace dataset are provided to standardize future assessments of conditional generative models in fairness-relevant downstream tasks.
Abstract
Recent years have witnessed a rapid development of deep generative models for creating synthetic media, such as images and videos. While the practical applications of these models in everyday tasks are enticing, it is crucial to assess the inherent risks regarding their fairness. In this work, we introduce a comprehensive framework for benchmarking the performance and fairness of conditional generative models. We develop a set of metrics$\unicode{x2013}$inspired by their supervised fairness counterparts$\unicode{x2013}$to evaluate the models on their fairness and diversity. Focusing on the specific application of image upsampling, we create a benchmark covering a wide variety of modern upsampling methods. As part of the benchmark, we introduce UnfairFace, a subset of FairFace that replicates the racial distribution of common large-scale face datasets. Our empirical study highlights the importance of using an unbiased training set and reveals variations in how the algorithms respond to dataset imbalances. Alarmingly, we find that none of the considered methods produces statistically fair and diverse results. All experiments can be reproduced using our provided repository.
