Table of Contents
Fetching ...

Investigating the Effects of Fairness Interventions Using Pointwise Representational Similarity

Camila Kolling, Till Speicher, Vedant Nanda, Mariya Toneva, Krishna P. Gummadi

TL;DR

This work introduces Pointwise Normalized Kernel Alignment (PNKA), a per-data-point representational similarity measure that audits how debiasing interventions alter intermediate representations $Z$ across individuals. By comparing pointwise neighborhood structures via $PNKA(Z,Z',i)=\cos(K(Z)_i, K(Z')_i)$, the method reveals nuanced effects of group versus individual fairness on tabular data (COMPAS, Adult) and language embeddings (GP-/GN-GloVe, SEAT/WEAT contexts). Empirical findings show that group fairness typically affects a small subset of individuals, while individual fairness shifts representations broadly; PNKA’s predictions align with downstream outcomes, suggesting its utility for fairness audits. In language domains, PNKA uncovers that debiasing methods may not remove biases from stereotypical contexts and can instead alter gender-definitional information, underscoring limitations of traditional evaluation metrics. Overall, PNKA provides a general, datapoint-focused framework to audit and predict debiasing effects across tasks and modalities, informing more robust fairness assessments in ML systems.

Abstract

Machine learning (ML) algorithms can often exhibit discriminatory behavior, negatively affecting certain populations across protected groups. To address this, numerous debiasing methods, and consequently evaluation measures, have been proposed. Current evaluation measures for debiasing methods suffer from two main limitations: (1) they primarily provide a global estimate of unfairness, failing to provide a more fine-grained analysis, and (2) they predominantly analyze the model output on a specific task, failing to generalize the findings to other tasks. In this work, we introduce Pointwise Normalized Kernel Alignment (PNKA), a pointwise representational similarity measure that addresses these limitations by measuring how debiasing measures affect the intermediate representations of individuals. On tabular data, the use of PNKA reveals previously unknown insights: while group fairness predominantly influences a small subset of the population, maintaining high representational similarity for the majority, individual fairness constraints uniformly impact representations across the entire population, altering nearly every data point. We show that by evaluating representations using PNKA, we can reliably predict the behavior of ML models trained on these representations. Moreover, applying PNKA to language embeddings shows that existing debiasing methods may not perform as intended, failing to remove biases from stereotypical words and sentences. Our findings suggest that current evaluation measures for debiasing methods are insufficient, highlighting the need for a deeper understanding of the effects of debiasing methods, and show how pointwise representational similarity metrics can help with fairness audits.

Investigating the Effects of Fairness Interventions Using Pointwise Representational Similarity

TL;DR

This work introduces Pointwise Normalized Kernel Alignment (PNKA), a per-data-point representational similarity measure that audits how debiasing interventions alter intermediate representations across individuals. By comparing pointwise neighborhood structures via , the method reveals nuanced effects of group versus individual fairness on tabular data (COMPAS, Adult) and language embeddings (GP-/GN-GloVe, SEAT/WEAT contexts). Empirical findings show that group fairness typically affects a small subset of individuals, while individual fairness shifts representations broadly; PNKA’s predictions align with downstream outcomes, suggesting its utility for fairness audits. In language domains, PNKA uncovers that debiasing methods may not remove biases from stereotypical contexts and can instead alter gender-definitional information, underscoring limitations of traditional evaluation metrics. Overall, PNKA provides a general, datapoint-focused framework to audit and predict debiasing effects across tasks and modalities, informing more robust fairness assessments in ML systems.

Abstract

Machine learning (ML) algorithms can often exhibit discriminatory behavior, negatively affecting certain populations across protected groups. To address this, numerous debiasing methods, and consequently evaluation measures, have been proposed. Current evaluation measures for debiasing methods suffer from two main limitations: (1) they primarily provide a global estimate of unfairness, failing to provide a more fine-grained analysis, and (2) they predominantly analyze the model output on a specific task, failing to generalize the findings to other tasks. In this work, we introduce Pointwise Normalized Kernel Alignment (PNKA), a pointwise representational similarity measure that addresses these limitations by measuring how debiasing measures affect the intermediate representations of individuals. On tabular data, the use of PNKA reveals previously unknown insights: while group fairness predominantly influences a small subset of the population, maintaining high representational similarity for the majority, individual fairness constraints uniformly impact representations across the entire population, altering nearly every data point. We show that by evaluating representations using PNKA, we can reliably predict the behavior of ML models trained on these representations. Moreover, applying PNKA to language embeddings shows that existing debiasing methods may not perform as intended, failing to remove biases from stereotypical words and sentences. Our findings suggest that current evaluation measures for debiasing methods are insufficient, highlighting the need for a deeper understanding of the effects of debiasing methods, and show how pointwise representational similarity metrics can help with fairness audits.
Paper Structure (33 sections, 8 equations, 18 figures, 4 tables)

This paper contains 33 sections, 8 equations, 18 figures, 4 tables.

Figures (18)

  • Figure 1: Distribution of PNKA similarity scores. The first row (blue plots) shows results for the COMPAS dataset, debiased with respect to race, while the second row (green plots) displays results for the Adult dataset, debiased based on gender. The vertical dotted line shows the overall similarity scores provided by CKA kornblith2019similarity. We compare the baseline representations (trained only for utility) with those from models trained using three different loss functions: Utility + Group Fairness (a, d), Utility + Individual Fairness (b, e), and Utility + Group and Individual Fairness (c, f).
  • Figure 2: Distribution of the binary attributes for the 10% most affected individuals (i.e., lowest PNKA score). The first row (blue plots) shows results for the COMPAS dataset, debiased with respect to race, while the second row (green plots) displays results for the Adult dataset, debiased based on gender. We compare the baseline representations (trained only for utility) with those from models trained using three different loss functions: Utility + Group Fairness (a, d), Utility + Individual Fairness (b, e), and Utility + Group and Individual Fairness (c, f). The horizontal red dotted line shows the population average per attribute.
  • Figure 3: Distribution of PNKA scores per group of words for SemBias dataset zhao2018learning. We compare the baseline (GloVe) model and its debiased versions. Words with the lowest similarity scores are the ones that change the most from the baseline to its debiased version. Across all debiased embeddings, the words whose embeddings change the most are the gender-definition words.
  • Figure 4: Relationship between PNKA scores (x-axis) and percentage difference (y-axis) in magnitude of the projection on the gender direction $\overrightarrow{he}$ - $\overrightarrow{she}$. A positive or negative percentage difference value indicates a shift in magnitude along the gender direction. Word embeddings that change their gender information are the ones that obtain low PNKA scores.
  • Figure 5: PNKA (with linear kernel) captures the overlap of $k$ nearest neighbors between two representations, i.e., the higher PNKA scores, the higher the fraction of overlapping neighbors. Results are an average over 3 runs, each one containing two models trained on CIFAR-10krizhevsky2009learning dataset with the same architecture but different random initialization.
  • ...and 13 more figures

Theorems & Definitions (2)

  • proof
  • proof