Table of Contents
Fetching ...

Preserving Task-Relevant Information Under Linear Concept Removal

Floris Holstege, Shauli Ravfogel, Bram Wouters

TL;DR

Preserving Task-Relevant Information Under Linear Concept Removal introduces SPLINCE, an oblique projection that removes linear predictability of a protected attribute while exactly preserving covariance with a target label. The method enforces two geometric constraints: placing Cov($\boldsymbol{x},\boldsymbol{z}$) in the kernel and keeping Cov($\boldsymbol{x},\boldsymbol{y}$) in the range, yielding a unique minimum-distortion solution $P^{*}_\mathrm{SPLINCE} = W^{+} V (U^{\mathrm{T}} V)^{-1} U^{\mathrm{T}} W$, derived from whitening and covariance subspaces. The authors show equivalence results: after re-training a last linear layer without regularization, different range choices yield identical predictions when the kernel is fixed, while range can matter when regularization or freezing applies. Empirically, SPLINCE improves average and worst-group accuracy on NLP benchmarks like Bias in Bios and Multilingual Text Detox, while controlling stereotypes in language models, though vision tasks show more limited gains and sometimes greater distortion, indicating domain-dependent efficacy.

Abstract

Modern neural networks often encode unwanted concepts alongside task-relevant information, leading to fairness and interpretability concerns. Existing post-hoc approaches can remove undesired concepts but often degrade useful signals. We introduce SPLINCE-Simultaneous Projection for LINear concept removal and Covariance prEservation - which eliminates sensitive concepts from representations while exactly preserving their covariance with a target label. SPLINCE achieves this via an oblique projection that 'splices out' the unwanted direction yet protects important label correlations. Theoretically, it is the unique solution that removes linear concept predictability and maintains target covariance with minimal embedding distortion. Empirically, SPLINCE outperforms baselines on benchmarks such as Bias in Bios and Winobias, removing protected attributes while minimally damaging main-task information.

Preserving Task-Relevant Information Under Linear Concept Removal

TL;DR

Preserving Task-Relevant Information Under Linear Concept Removal introduces SPLINCE, an oblique projection that removes linear predictability of a protected attribute while exactly preserving covariance with a target label. The method enforces two geometric constraints: placing Cov() in the kernel and keeping Cov() in the range, yielding a unique minimum-distortion solution , derived from whitening and covariance subspaces. The authors show equivalence results: after re-training a last linear layer without regularization, different range choices yield identical predictions when the kernel is fixed, while range can matter when regularization or freezing applies. Empirically, SPLINCE improves average and worst-group accuracy on NLP benchmarks like Bias in Bios and Multilingual Text Detox, while controlling stereotypes in language models, though vision tasks show more limited gains and sometimes greater distortion, indicating domain-dependent efficacy.

Abstract

Modern neural networks often encode unwanted concepts alongside task-relevant information, leading to fairness and interpretability concerns. Existing post-hoc approaches can remove undesired concepts but often degrade useful signals. We introduce SPLINCE-Simultaneous Projection for LINear concept removal and Covariance prEservation - which eliminates sensitive concepts from representations while exactly preserving their covariance with a target label. SPLINCE achieves this via an oblique projection that 'splices out' the unwanted direction yet protects important label correlations. Theoretically, it is the unique solution that removes linear concept predictability and maintains target covariance with minimal embedding distortion. Empirically, SPLINCE outperforms baselines on benchmarks such as Bias in Bios and Winobias, removing protected attributes while minimally damaging main-task information.

Paper Structure

This paper contains 32 sections, 7 theorems, 57 equations, 15 figures, 7 tables.

Key Result

Theorem 1

Let $\boldsymbol{x}$ and $\boldsymbol{z}, \boldsymbol{y}$ be random vectors with finite second moments, non-zero covariances between $\boldsymbol{x}$ and $\boldsymbol{z},$ and between $\boldsymbol{x}$ and $\boldsymbol{y},$ and $\mathbb{E}[\boldsymbol{x}] = \mathbf{0}$. Let $\mathbf{W} = (\boldsymbol subject to the two constraints to be referred to as the kernel and range constraint, respectively,

Figures (15)

  • Figure 1: Illustration of the different steps for the projection suggested by Theorem \ref{['thm:SPLINCE']} on two-dimensional data. The data $(a)$ is whitened $(b)$. Then, we use $\mathbf{V} (\mathbf{U}^{\mathrm{T}} \mathbf{V})^{-1} \mathbf{U}^{\mathrm{T}}$ to project parallel to $\mathbf{W}\boldsymbol{\Sigma}_{\boldsymbol{x}, \boldsymbol{z}}$ onto $\mathbf{W}\boldsymbol{\Sigma}_{\boldsymbol{x}, \boldsymbol{y}}$, and subsequently unwhiten $(c)$. With LEACE, the $\boldsymbol{\Sigma}_{\boldsymbol{x}, \boldsymbol{y}}$ is altered $(d)$.
  • Figure 2: Performance of different projections on the Bias in Bios and Multilingual Text Detoxification dataset. We re-train the last-layer after applying each projection. Points are based on the average over 3 seeds, 5 seeds respectively for the two datasets. The error bars reflect the 95% confidence interval.
  • Figure 3: Results of applying different projections to the last layer of various Llama models for the Winobias dataset. The left plot shows the accuracy on a test set consisting of half pro-stereotypical and half anti-stereotypical prompts. The right plot shows the accuracy on the anti-stereotypical prompts in this test set.
  • Figure 4: Application of different projections to raw pixel data of CelebA. The first columns shows the original image. The next four columns show the image, after the respective projection. The final three columns indicate the difference between the original image and the image after the projection.
  • Figure 5: The difference between the SPLINCE and LEACE projections on the Bias in Bios dataset for different levels of $l_2$ regularization. We show the difference (worst-group) accuracy of SPLINCE minus the (worst-group) accuracy of LEACE. We re-train the last-layer after applying each projection. Points are based on the average over 3 seeds. The error bars reflect the 95% confidence interval.
  • ...and 10 more figures

Theorems & Definitions (13)

  • Theorem 1
  • Theorem 2: Equivalent predictions after oblique re-training
  • Lemma 1
  • proof
  • Lemma 2
  • proof
  • Lemma 3
  • proof
  • proof : Proof of \ref{['thm:with_re-training_no_regularisation']}
  • Theorem 3: Excess-risk of leace and splince in a regression setting
  • ...and 3 more