Table of Contents
Fetching ...

Constructing Fair Latent Space for Intersection of Fairness and Explainability

Hyungjun Joo, Hyeonggeun Han, Sehwan Kim, Sangwoo Hong, Jungwoo Lee

TL;DR

This work tackles the intersection of fairness and explainability by introducing a modular platform that augments a pretrained generative model with a fair latent-space module. By disentangling label information $Z^Y$ from sensitive attributes $Z^S$ using information bottleneck principles and enforcing a diagonal, evenly scaled covariance via an invertible neural network, the approach enables faithful counterfactual explanations and per-instance fairness. Empirical results on CelebA, CelebAHQ, and UTKFace show substantial improvements in EO, DP, and WGA over baselines, while maintaining efficiency by-training only the INN module. The method yields actionable counterfactuals and explanations that help stakeholders assess and trust model decisions, with practical implications for deploying fair generative systems without full retraining of large models.

Abstract

As the use of machine learning models has increased, numerous studies have aimed to enhance fairness. However, research on the intersection of fairness and explainability remains insufficient, leading to potential issues in gaining the trust of actual users. Here, we propose a novel module that constructs a fair latent space, enabling faithful explanation while ensuring fairness. The fair latent space is constructed by disentangling and redistributing labels and sensitive attributes, allowing the generation of counterfactual explanations for each type of information. Our module is attached to a pretrained generative model, transforming its biased latent space into a fair latent space. Additionally, since only the module needs to be trained, there are advantages in terms of time and cost savings, without the need to train the entire generative model. We validate the fair latent space with various fairness metrics and demonstrate that our approach can effectively provide explanations for biased decisions and assurances of fairness.

Constructing Fair Latent Space for Intersection of Fairness and Explainability

TL;DR

This work tackles the intersection of fairness and explainability by introducing a modular platform that augments a pretrained generative model with a fair latent-space module. By disentangling label information from sensitive attributes using information bottleneck principles and enforcing a diagonal, evenly scaled covariance via an invertible neural network, the approach enables faithful counterfactual explanations and per-instance fairness. Empirical results on CelebA, CelebAHQ, and UTKFace show substantial improvements in EO, DP, and WGA over baselines, while maintaining efficiency by-training only the INN module. The method yields actionable counterfactuals and explanations that help stakeholders assess and trust model decisions, with practical implications for deploying fair generative systems without full retraining of large models.

Abstract

As the use of machine learning models has increased, numerous studies have aimed to enhance fairness. However, research on the intersection of fairness and explainability remains insufficient, leading to potential issues in gaining the trust of actual users. Here, we propose a novel module that constructs a fair latent space, enabling faithful explanation while ensuring fairness. The fair latent space is constructed by disentangling and redistributing labels and sensitive attributes, allowing the generation of counterfactual explanations for each type of information. Our module is attached to a pretrained generative model, transforming its biased latent space into a fair latent space. Additionally, since only the module needs to be trained, there are advantages in terms of time and cost savings, without the need to train the entire generative model. We validate the fair latent space with various fairness metrics and demonstrate that our approach can effectively provide explanations for biased decisions and assurances of fairness.

Paper Structure

This paper contains 31 sections, 4 theorems, 19 equations, 5 figures, 7 tables.

Key Result

Theorem 1

Let the representation $Z^Y$ follow a Gaussian distribution, and $\beta > 1$. The information bottleneck-based loss $L_{\mathrm{IB}} = I(Z^Y, E) - \beta I(Z^Y, Y)$ can be reformulated as:

Figures (5)

  • Figure 1: (A) Models aimed at enhancing fairness without any explanation. (B) The proposed model trains an invertible neural network based on a pre-trained generative model to construct a fair latent space where the information of labels and sensitive attributes is disentangled into separate dimensions. The Y-axis corresponds to the dimension of the sensitive attribute, while the X-axis corresponds to the dimension of the label. (C) Counterfactual explanations can be generated by adjusting values in the opposite direction within a fair latent space. Using an INN and a frozen generator, $x'$ and $x"$ are generated from $z'$ and $z"$.
  • Figure 2: Overview of our approach connecting theoretical analysis to practical implementation, comprising three main components. The distance loss $L_{di}$ regulates distances to respond specifically to attributes. Furthermore, the diagonalizing loss $L_{dg}$ and equalizing loss $L_{eq}$ transform the covariance matrix into an identical diagonal matrix.
  • Figure 3: Counterfactual explanations with samples initially misclassified as unattractive by the original model. The x-axis indicates changes in the latent space based on the direction of classifier $\hat{h}$. In the original model, (a) counterfactuals of attractiveness reveal a clear correlation with gender. After constructing a fair latent space by isolating $Z^S=male$, we can observe (b) counterfactuals of attractiveness that exhibit no gender bias, and (c) counterfactuals across genders with equal attractiveness.
  • Figure 4: (Left) Gender misclassification rates when representations obtained from the CelebAHQ test dataset are shifted along the unit vector of the $attractive$ classifier. (Right) Gender distribution after generating 1,000 images by shifting the mean of a standard Gaussian distribution along the unit vector of the $attractive$ classifier.
  • Figure 5: Counterfactual explanations for samples correctly classified by our model with the label $attractive$ and the sensitive attribute $young$.

Theorems & Definitions (6)

  • Theorem 1
  • Theorem 2
  • Theorem
  • proof
  • Theorem
  • proof