Log-linear Guardedness and its Implications

Shauli Ravfogel; Yoav Goldberg; Ryan Cotterell

Log-linear Guardedness and its Implications

Shauli Ravfogel, Yoav Goldberg, Ryan Cotterell

TL;DR

The paper formalizes concept erasure through the lens of log-linear guardedness, using $\mathcal{V}$-information to quantify how much protected-attribute information remains after applying a guarding function $h$ to representations. It demonstrates that, with a binary downstream task and a discretized log-linear model family $\mathcal{V}^{\delta}$, leaked information is tightly bounded ($I_{\mathcal{V}^{\delta}}(\widehat{\mathrm{Y}} \to \mathrm{Z}) < \varepsilon$). However, for multiclass downstream tasks with a softmax, guarded representations can still leak substantial information about the protected attribute via appropriately structured $K$-Voronoi distributions, challenging the completeness of linear erasure as a bias mitigation. The experiments using RLACE to guard BERT representations on the Bias in Bios dataset corroborate the theory: binary leakage is reduced but not eliminated, while multiclass settings can reveal the protected attribute, highlighting the need for caution in applying linear erasure methods and motivating further study of intrinsic vs extrinsic bias. Overall, the work clarifies the limitations of log-linear guardedness and motivates new approaches that consider the specifics of downstream classifiers and task structure.

Abstract

Methods for erasing human-interpretable concepts from neural representations that assume linearity have been found to be tractable and useful. However, the impact of this removal on the behavior of downstream classifiers trained on the modified representations is not fully understood. In this work, we formally define the notion of log-linear guardedness as the inability of an adversary to predict the concept directly from the representation, and study its implications. We show that, in the binary case, under certain assumptions, a downstream log-linear model cannot recover the erased concept. However, we demonstrate that a multiclass log-linear model \emph{can} be constructed that indirectly recovers the concept in some cases, pointing to the inherent limitations of log-linear guardedness as a downstream bias mitigation technique. These findings shed light on the theoretical limitations of linear erasure methods and highlight the need for further research on the connections between intrinsic and extrinsic bias in neural models.

Log-linear Guardedness and its Implications

TL;DR

The paper formalizes concept erasure through the lens of log-linear guardedness, using

-information to quantify how much protected-attribute information remains after applying a guarding function

to representations. It demonstrates that, with a binary downstream task and a discretized log-linear model family

, leaked information is tightly bounded (

). However, for multiclass downstream tasks with a softmax, guarded representations can still leak substantial information about the protected attribute via appropriately structured

-Voronoi distributions, challenging the completeness of linear erasure as a bias mitigation. The experiments using RLACE to guard BERT representations on the Bias in Bios dataset corroborate the theory: binary leakage is reduced but not eliminated, while multiclass settings can reveal the protected attribute, highlighting the need for caution in applying linear erasure methods and motivating further study of intrinsic vs extrinsic bias. Overall, the work clarifies the limitations of log-linear guardedness and motivates new approaches that consider the specifics of downstream classifiers and task structure.

Abstract

Paper Structure (30 sections, 5 theorems, 26 equations, 3 figures)

This paper contains 30 sections, 5 theorems, 26 equations, 3 figures.

Introduction
Information-Theoretic Guardedness
Preliminaries
$\mathcal{V}$-Information
Guardedness
Theoretical Analysis
Problem Formulation
A Binary Downstream Classifier
A Multiclass Downstream Classifier
Accuracy-Based Guardedness
Experimental Evaluation
Data.
Approximating log-linear guardedness.
Quantifying Empirical Guardedness
Binary ${{\color{black} \mathrm{Z}}}$ and ${{\color{black} \mathrm{Y}}}$
...and 15 more sections

Key Result

Theorem 3.2

Let $\color{black} \mathcal{V}^\delta$ be the family of $\delta$-discretized log-linear models, and let ${\boldsymbol{ {\color{black} \mathrm{X}}}}$ be a representation-valued random variable. Define ${ {\color{black} \mathrm{\widehat{Y}}}}$ as in eq:binary-yhat, then $\color{black} {\color{black} \

Figures (3)

Figure 1: Construction of a log-linear model that breaks log-linear guardedness.
Figure 2: Results for \ref{['sec:binary-experiment']}. Estimate of $\color{black} \mathcal{V}$-information between the protected attribute and (1) the original representations ( red); (2) the labels induced by the inner model within a composition of two log-linear models, trained to adversarially recover gender ( blue); (3) labels for the downstream task (the predictions of profession classifiers; orange). The curve is the mean over different pairs of professions, and the shaded area representations 1 standard deviation. The $x$-axis presents results for different values of the threshold $\delta$. Recall the threshholding is applied post hoc.
Figure 3: Results for \ref{['sec:multi-experiment']}. Estimate of $\color{black} \mathcal{V}$-information between the protected attribute and ${{ {\color{black} \mathrm{\widehat{Y}}}}_{\mathrm{a}}}$ with various $\delta$.

Theorems & Definitions (17)

Definition 2.1: $\color{black} \mathcal{V}$-Guardedness
Definition 2.2: Empirical $\color{black} \mathcal{V}$-Guardedness
Definition 3.1: Discretized Log-Linear Models
Theorem 3.2
proof
Definition 3.3
Theorem 3.4
proof
Definition 4.1: Accuracy-based $\color{black} \mathcal{V}$-guardedness
Definition 4.2: Accuracy-based Empirical $\color{black} \mathcal{V}$-guardedness
...and 7 more

Log-linear Guardedness and its Implications

TL;DR

Abstract

Log-linear Guardedness and its Implications

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (3)

Theorems & Definitions (17)