What Happens to a Dataset Transformed by a Projection-based Concept Removal Method?
Richard Johansson
TL;DR
This paper analyzes projection-based concept removal methods, showing that transforming representations does not erase the targeted information but instead induces strong row-wise dependencies that violate i.i.d. assumptions. Through theoretical analysis (focused on Mean Projection) and extensive experiments on synthetic and real data (BoW and BERT representations), it demonstrates that the transformed dataset places instances near opposite-label neighbors, and anti-clustering can often recover the original groupings. Cross-validated accuracies for predicting the removed concept fall below chance, and predicted-probability distributions shift away from i.i.d. baselines, highlighting significant structural changes in the data geometry. The findings warn that projection-based scrubbing may mislead practitioners about dataset independence and have implications for privacy, causal inference, and data-distribution practices when releasing processed datasets.
Abstract
We investigate the behavior of methods that use linear projections to remove information about a concept from a language representation, and we consider the question of what happens to a dataset transformed by such a method. A theoretical analysis and experiments on real-world and synthetic data show that these methods inject strong statistical dependencies into the transformed datasets. After applying such a method, the representation space is highly structured: in the transformed space, an instance tends to be located near instances of the opposite label. As a consequence, the original labeling can in some cases be reconstructed by applying an anti-clustering method.
