Table of Contents
Fetching ...

TaCo: Targeted Concept Erasure Prevents Non-Linear Classifiers From Detecting Protected Attributes

Fanny Jourdan, Louis Béthune, Agustin Picard, Laurent Risser, Nicholas Asher

TL;DR

This work introduces Targeted Concept Erasure (TaCo), a novel approach that removes sensitive information from final latent representations, ensuring fairness even against non-linear classifiers.

Abstract

Ensuring fairness in NLP models is crucial, as they often encode sensitive attributes like gender and ethnicity, leading to biased outcomes. Current concept erasure methods attempt to mitigate this by modifying final latent representations to remove sensitive information without retraining the entire model. However, these methods typically rely on linear classifiers, which leave models vulnerable to non-linear adversaries capable of recovering sensitive information. We introduce Targeted Concept Erasure (TaCo), a novel approach that removes sensitive information from final latent representations, ensuring fairness even against non-linear classifiers. Our experiments show that TaCo outperforms state-of-the-art methods, achieving greater reductions in the prediction accuracy of sensitive attributes by non-linear classifier while preserving overall task performance. Code is available on https://github.com/fanny-jourdan/TaCo.

TaCo: Targeted Concept Erasure Prevents Non-Linear Classifiers From Detecting Protected Attributes

TL;DR

This work introduces Targeted Concept Erasure (TaCo), a novel approach that removes sensitive information from final latent representations, ensuring fairness even against non-linear classifiers.

Abstract

Ensuring fairness in NLP models is crucial, as they often encode sensitive attributes like gender and ethnicity, leading to biased outcomes. Current concept erasure methods attempt to mitigate this by modifying final latent representations to remove sensitive information without retraining the entire model. However, these methods typically rely on linear classifiers, which leave models vulnerable to non-linear adversaries capable of recovering sensitive information. We introduce Targeted Concept Erasure (TaCo), a novel approach that removes sensitive information from final latent representations, ensuring fairness even against non-linear classifiers. Our experiments show that TaCo outperforms state-of-the-art methods, achieving greater reductions in the prediction accuracy of sensitive attributes by non-linear classifier while preserving overall task performance. Code is available on https://github.com/fanny-jourdan/TaCo.
Paper Structure (42 sections, 10 equations, 12 figures)

This paper contains 42 sections, 10 equations, 12 figures.

Figures (12)

  • Figure 1: Overview of TaCo method. A decomposition of the final latent embedding matrix (I) yields concepts, whose importance (II) with respect to the sensitive variable and the label are evaluated with Sobol method. Finally, some concepts are removed, which "neutralizes" the sensitive variable information and (III) produces a fairer classifier.
  • Figure 2: Co-importance plot for $r=20$ dimensions with respect to Occupation and Gender labels, computed with the Sobol method on RoBERTa model. The color is based on the angle $a=90-\frac{2}{\pi}\arctan{(\frac{y}{x})}$. The angles with extreme values correspond to concepts of high importance for gender but comparatively low importance for occupation. Their removal yields a favorable tradeoff in accuracy/fairness.
  • Figure 3: Gender prediction accuracy versus the occupation accuracy drop -- using a two-layer MLP -- after concept erasure methods on (top left) DistilBERT, (top right) RoBERTa, (bottom left) DeBERTa, and (bottom right) T5 representations. The occupation accuracy drop is reported relatively to the Baseline model, when the value is negative for a method, this means that occupation accuracy is better with the method than it was initially. For INLP and RLACE, points represent different numbers of masked dimensions. For TaCo methods, points represent different numbers of removed concepts. LEACE and the Bios-neutral model are shown as single points, as they do not require parameter adjustments. The horizontal dashed line represents the accuracy of the Optimal Bayes classifier, indicating the theoretical lower bound for gender prediction accuracy. Figures for linear classifier on Figure \ref{['apx:fig:linear_results']}.
  • Figure 4: Convergence curves (accuracy) during training for the gender classification task, 5 epochs. (On the left) RoBERTa model trained on Bios; (On the right): RoBERTa model trained on Bios without explicit gender indicators.
  • Figure 5: Number of biographies for each occupation by gender on the total Bios dataset de2019bias.
  • ...and 7 more figures