What Happens to a Dataset Transformed by a Projection-based Concept Removal Method?

Richard Johansson

What Happens to a Dataset Transformed by a Projection-based Concept Removal Method?

Richard Johansson

TL;DR

This paper analyzes projection-based concept removal methods, showing that transforming representations does not erase the targeted information but instead induces strong row-wise dependencies that violate i.i.d. assumptions. Through theoretical analysis (focused on Mean Projection) and extensive experiments on synthetic and real data (BoW and BERT representations), it demonstrates that the transformed dataset places instances near opposite-label neighbors, and anti-clustering can often recover the original groupings. Cross-validated accuracies for predicting the removed concept fall below chance, and predicted-probability distributions shift away from i.i.d. baselines, highlighting significant structural changes in the data geometry. The findings warn that projection-based scrubbing may mislead practitioners about dataset independence and have implications for privacy, causal inference, and data-distribution practices when releasing processed datasets.

Abstract

We investigate the behavior of methods that use linear projections to remove information about a concept from a language representation, and we consider the question of what happens to a dataset transformed by such a method. A theoretical analysis and experiments on real-world and synthetic data show that these methods inject strong statistical dependencies into the transformed datasets. After applying such a method, the representation space is highly structured: in the transformed space, an instance tends to be located near instances of the opposite label. As a consequence, the original labeling can in some cases be reconstructed by applying an anti-clustering method.

What Happens to a Dataset Transformed by a Projection-based Concept Removal Method?

TL;DR

Abstract

Paper Structure (15 sections, 1 theorem, 4 figures)

This paper contains 15 sections, 1 theorem, 4 figures.

Introduction
Concept Removal Methods
Theoretical Analysis
Experiments
Datasets
Prediction Accuracy
Predicted Probabilities
Neighborhood Structure
Recovering the Original Grouping
Related Work
Implications and Conclusion
Limitations
Ethical Discussion
Acknowledgements
Bibliographical References

Key Result

Theorem 1

Let $X \in \mathbb{R}^{m,n}$ be a feature matrix and $Y \in \{0,1\}^m$ the class labels. MP is then applied to $X$ with respect to $Y$ and we refer to the result as $X_{\mathrm{MP}}$. We carry out a leave-one-out cross-validation in the transformed dataset where we set a single instance $x_i, y_i$ a

Figures (4)

Figure 1: Cross-validated accuracy scores for predicting the removed concept over INLP iterations. Each curve corresponds to a size $n$ of the dataset.
Figure 2: Distribution of predicted probabilities for the positive (orange) and negative classes (blue).
Figure 3: Proportion of instances whose nearest neighbor is of the opposite label, for different $n$.
Figure 4: Cluster purity scores comparing the original labeling to the anti-clustering result.

Theorems & Definitions (2)

Theorem
proof

What Happens to a Dataset Transformed by a Projection-based Concept Removal Method?

TL;DR

Abstract

What Happens to a Dataset Transformed by a Projection-based Concept Removal Method?

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (4)

Theorems & Definitions (2)