Table of Contents
Fetching ...

Towards Fair and Robust Face Parsing for Generative AI: A Multi-Objective Approach

Sophia J. Abraham, Jonathan D. Hauenstein, Walter J. Scheirer

TL;DR

This work tackles bias and fragility in facial parsing by proposing a homotopy-based multi-objective framework that jointly optimizes accuracy, fairness, and robustness. The method combines a Dice-based accuracy term, a fairness term capturing variance of per-group mIoU, and a robustness term against perturbations, scheduled dynamically via $\\alpha(t)$, $\\beta(t)$, and $\\gamma(t)$ with options including Linear, Sigmoid, and Piecewise schedules. The authors validate the approach by integrating both single- and multi-objective parsers into GAN-based (Pix2PixHD) and diffusion-based (ControlNet) face synthesis pipelines, showing improvements in segmentation fairness, robustness, and downstream synthesis quality as measured by $\\mathrm{FID}$ and $\\mathrm{LPIPS}$. They provide a comprehensive evaluation on CelebAMask-HQ, including class-wise segmentation, perturbation tests, and cross-method comparisons, and present preliminary diffusion-based results to motivate broader exploration. The work demonstrates that fairness-aware segmentation can enhance photorealism and demographic consistency in generated faces, offering a pathway toward bias-aware generative AI while acknowledging computational and dataset limitations and proposing future directions for broader applicability.

Abstract

Face parsing is a fundamental task in computer vision, enabling applications such as identity verification, facial editing, and controllable image synthesis. However, existing face parsing models often lack fairness and robustness, leading to biased segmentation across demographic groups and errors under occlusions, noise, and domain shifts. These limitations affect downstream face synthesis, where segmentation biases can degrade generative model outputs. We propose a multi-objective learning framework that optimizes accuracy, fairness, and robustness in face parsing. Our approach introduces a homotopy-based loss function that dynamically adjusts the importance of these objectives during training. To evaluate its impact, we compare multi-objective and single-objective U-Net models in a GAN-based face synthesis pipeline (Pix2PixHD). Our results show that fairness-aware and robust segmentation improves photorealism and consistency in face generation. Additionally, we conduct preliminary experiments using ControlNet, a structured conditioning model for diffusion-based synthesis, to explore how segmentation quality influences guided image generation. Our findings demonstrate that multi-objective face parsing improves demographic consistency and robustness, leading to higher-quality GAN-based synthesis.

Towards Fair and Robust Face Parsing for Generative AI: A Multi-Objective Approach

TL;DR

This work tackles bias and fragility in facial parsing by proposing a homotopy-based multi-objective framework that jointly optimizes accuracy, fairness, and robustness. The method combines a Dice-based accuracy term, a fairness term capturing variance of per-group mIoU, and a robustness term against perturbations, scheduled dynamically via , , and with options including Linear, Sigmoid, and Piecewise schedules. The authors validate the approach by integrating both single- and multi-objective parsers into GAN-based (Pix2PixHD) and diffusion-based (ControlNet) face synthesis pipelines, showing improvements in segmentation fairness, robustness, and downstream synthesis quality as measured by and . They provide a comprehensive evaluation on CelebAMask-HQ, including class-wise segmentation, perturbation tests, and cross-method comparisons, and present preliminary diffusion-based results to motivate broader exploration. The work demonstrates that fairness-aware segmentation can enhance photorealism and demographic consistency in generated faces, offering a pathway toward bias-aware generative AI while acknowledging computational and dataset limitations and proposing future directions for broader applicability.

Abstract

Face parsing is a fundamental task in computer vision, enabling applications such as identity verification, facial editing, and controllable image synthesis. However, existing face parsing models often lack fairness and robustness, leading to biased segmentation across demographic groups and errors under occlusions, noise, and domain shifts. These limitations affect downstream face synthesis, where segmentation biases can degrade generative model outputs. We propose a multi-objective learning framework that optimizes accuracy, fairness, and robustness in face parsing. Our approach introduces a homotopy-based loss function that dynamically adjusts the importance of these objectives during training. To evaluate its impact, we compare multi-objective and single-objective U-Net models in a GAN-based face synthesis pipeline (Pix2PixHD). Our results show that fairness-aware and robust segmentation improves photorealism and consistency in face generation. Additionally, we conduct preliminary experiments using ControlNet, a structured conditioning model for diffusion-based synthesis, to explore how segmentation quality influences guided image generation. Our findings demonstrate that multi-objective face parsing improves demographic consistency and robustness, leading to higher-quality GAN-based synthesis.

Paper Structure

This paper contains 32 sections, 2 equations, 6 figures, 4 tables, 1 algorithm.

Figures (6)

  • Figure 2: Comparison of $\alpha$, $\beta$, and $\gamma$ schedules across three homotopy methods (Linear, Sigmoid, and Piecewise) over 30 epochs. Each subplot illustrates the evolution of a parameter ($\alpha$, $\beta$, or $\gamma$) as it adapts during training, highlighting the differences in transition dynamics across homotopy strategies. The legend below the figure identifies the homotopy method for each curve.
  • Figure 3: Qualitative comparison of Single-Objective and Multi-Objective models under perturbations. Blur ($\text{severity} = 0.3$), Gaussian Noise ($\text{severity} = 0.1$), and Occlusion ($\text{severity} = 0.5$) are applied to input images (first column). The Single-Objective model produces fragmented and inaccurate segmentations, especially in occluded and blurred regions. In contrast, the Multi-Objective model exhibits greater robustness, preserving facial structure despite degradations, with improved stability under occlusion.
  • Figure 4: Comparison of Fairness Loss Strategies on High-Disparity Demographics. The left plot represents the fairness variance-based approach, which minimizes the variance of per-group mIoU scores, indirectly reducing fairness gaps across demographic attributes. The right plot represents the per-group mIoU fairness loss, which explicitly tracks and optimizes fairness at a finer granularity. While the variance-based approach smooths out overall disparities, the per-group fairness loss provides better control over specific demographic attributes, ensuring higher consistency across subpopulations. Multi-objective models (Linear, Sigmoid, Piecewise) tend to provide more equitable segmentation across demographics compared to the Single-Objective baseline, though certain attributes still show variability in performance.
  • Figure 5: Performance comparison of mIoU across methods and perturbation types under varying severities. The plot illustrates the sensitivity of Single Objective and Multi-Objective methods (Linear, Sigmoid, and Piecewise) to perturbations, categorized by Gaussian noise, blur, occlusion, and salt-and-pepper noise. Each method is distinguished using different line styles and markers, while colors indicate the perturbation types.
  • Figure 6: Impact of Segmentation Maps on GAN-Based Face Synthesis. Segmentation maps from a Single-Objective U-Net and Multi-Objective U-Nets (Linear, Sigmoid, Piecewise) serve as inputs to a Pix2Pix GAN. Single-objective segmentation introduces inconsistencies, distorting facial details. In contrast, multi-objective segmentation improves structural coherence, yielding more natural and perceptually accurate face synthesis.
  • ...and 1 more figures