Towards Fair and Robust Face Parsing for Generative AI: A Multi-Objective Approach
Sophia J. Abraham, Jonathan D. Hauenstein, Walter J. Scheirer
TL;DR
This work tackles bias and fragility in facial parsing by proposing a homotopy-based multi-objective framework that jointly optimizes accuracy, fairness, and robustness. The method combines a Dice-based accuracy term, a fairness term capturing variance of per-group mIoU, and a robustness term against perturbations, scheduled dynamically via $\\alpha(t)$, $\\beta(t)$, and $\\gamma(t)$ with options including Linear, Sigmoid, and Piecewise schedules. The authors validate the approach by integrating both single- and multi-objective parsers into GAN-based (Pix2PixHD) and diffusion-based (ControlNet) face synthesis pipelines, showing improvements in segmentation fairness, robustness, and downstream synthesis quality as measured by $\\mathrm{FID}$ and $\\mathrm{LPIPS}$. They provide a comprehensive evaluation on CelebAMask-HQ, including class-wise segmentation, perturbation tests, and cross-method comparisons, and present preliminary diffusion-based results to motivate broader exploration. The work demonstrates that fairness-aware segmentation can enhance photorealism and demographic consistency in generated faces, offering a pathway toward bias-aware generative AI while acknowledging computational and dataset limitations and proposing future directions for broader applicability.
Abstract
Face parsing is a fundamental task in computer vision, enabling applications such as identity verification, facial editing, and controllable image synthesis. However, existing face parsing models often lack fairness and robustness, leading to biased segmentation across demographic groups and errors under occlusions, noise, and domain shifts. These limitations affect downstream face synthesis, where segmentation biases can degrade generative model outputs. We propose a multi-objective learning framework that optimizes accuracy, fairness, and robustness in face parsing. Our approach introduces a homotopy-based loss function that dynamically adjusts the importance of these objectives during training. To evaluate its impact, we compare multi-objective and single-objective U-Net models in a GAN-based face synthesis pipeline (Pix2PixHD). Our results show that fairness-aware and robust segmentation improves photorealism and consistency in face generation. Additionally, we conduct preliminary experiments using ControlNet, a structured conditioning model for diffusion-based synthesis, to explore how segmentation quality influences guided image generation. Our findings demonstrate that multi-objective face parsing improves demographic consistency and robustness, leading to higher-quality GAN-based synthesis.
