Preserving Task-Relevant Information Under Linear Concept Removal
Floris Holstege, Shauli Ravfogel, Bram Wouters
TL;DR
Preserving Task-Relevant Information Under Linear Concept Removal introduces SPLINCE, an oblique projection that removes linear predictability of a protected attribute while exactly preserving covariance with a target label. The method enforces two geometric constraints: placing Cov($\boldsymbol{x},\boldsymbol{z}$) in the kernel and keeping Cov($\boldsymbol{x},\boldsymbol{y}$) in the range, yielding a unique minimum-distortion solution $P^{*}_\mathrm{SPLINCE} = W^{+} V (U^{\mathrm{T}} V)^{-1} U^{\mathrm{T}} W$, derived from whitening and covariance subspaces. The authors show equivalence results: after re-training a last linear layer without regularization, different range choices yield identical predictions when the kernel is fixed, while range can matter when regularization or freezing applies. Empirically, SPLINCE improves average and worst-group accuracy on NLP benchmarks like Bias in Bios and Multilingual Text Detox, while controlling stereotypes in language models, though vision tasks show more limited gains and sometimes greater distortion, indicating domain-dependent efficacy.
Abstract
Modern neural networks often encode unwanted concepts alongside task-relevant information, leading to fairness and interpretability concerns. Existing post-hoc approaches can remove undesired concepts but often degrade useful signals. We introduce SPLINCE-Simultaneous Projection for LINear concept removal and Covariance prEservation - which eliminates sensitive concepts from representations while exactly preserving their covariance with a target label. SPLINCE achieves this via an oblique projection that 'splices out' the unwanted direction yet protects important label correlations. Theoretically, it is the unique solution that removes linear concept predictability and maintains target covariance with minimal embedding distortion. Empirically, SPLINCE outperforms baselines on benchmarks such as Bias in Bios and Winobias, removing protected attributes while minimally damaging main-task information.
