Table of Contents
Fetching ...

A Geometric Notion of Causal Probing

Clément Guerner, Tianyu Liu, Anej Svete, Alexander Warstadt, Ryan Cotterell

TL;DR

The paper addresses how linguistic concepts are encoded in language model representations by introducing a geometric, intrinsic information-theoretic framework that identifies concept subspaces and disentangles them from spuriously correlated features. It formalizes erasure, encapsulation, containment, and stability within a counterfactual distribution, and builds a causal model with a latent concept variable to enable do-interventions for controlled generation. Empirically, it shows that a near one-dimensional subspace can encode verbal-number information and enable causal manipulation of generation in some models/languages, while findings for grammatical gender are more limited; CEBaB results suggest the approach can improve causal effect estimation over some baselines. The work advances understanding of how concepts are represented and manipulated in generation, offering a path toward principled, causal concept control in LM systems, albeit with assumptions and non-linearities that warrant further refinement.

Abstract

The linear subspace hypothesis (Bolukbasi et al., 2016) states that, in a language model's representation space, all information about a concept such as verbal number is encoded in a linear subspace. Prior work has relied on auxiliary classification tasks to identify and evaluate candidate subspaces that might give support for this hypothesis. We instead give a set of intrinsic criteria which characterize an ideal linear concept subspace and enable us to identify the subspace using only the language model distribution. Our information-theoretic framework accounts for spuriously correlated features in the representation space (Kumar et al., 2022) by reconciling the statistical notion of concept information and the geometric notion of how concepts are encoded in the representation space. As a byproduct of this analysis, we hypothesize a causal process for how a language model might leverage concepts during generation. Empirically, we find that linear concept erasure is successful in erasing most concept information under our framework for verbal number as well as some complex aspect-level sentiment concepts from a restaurant review dataset. Our causal intervention for controlled generation shows that, for at least one concept across two languages models, the concept subspace can be used to manipulate the concept value of the generated word with precision.

A Geometric Notion of Causal Probing

TL;DR

The paper addresses how linguistic concepts are encoded in language model representations by introducing a geometric, intrinsic information-theoretic framework that identifies concept subspaces and disentangles them from spuriously correlated features. It formalizes erasure, encapsulation, containment, and stability within a counterfactual distribution, and builds a causal model with a latent concept variable to enable do-interventions for controlled generation. Empirically, it shows that a near one-dimensional subspace can encode verbal-number information and enable causal manipulation of generation in some models/languages, while findings for grammatical gender are more limited; CEBaB results suggest the approach can improve causal effect estimation over some baselines. The work advances understanding of how concepts are represented and manipulated in generation, offering a path toward principled, causal concept control in LM systems, albeit with assumptions and non-linearities that warrant further refinement.

Abstract

The linear subspace hypothesis (Bolukbasi et al., 2016) states that, in a language model's representation space, all information about a concept such as verbal number is encoded in a linear subspace. Prior work has relied on auxiliary classification tasks to identify and evaluate candidate subspaces that might give support for this hypothesis. We instead give a set of intrinsic criteria which characterize an ideal linear concept subspace and enable us to identify the subspace using only the language model distribution. Our information-theoretic framework accounts for spuriously correlated features in the representation space (Kumar et al., 2022) by reconciling the statistical notion of concept information and the geometric notion of how concepts are encoded in the representation space. As a byproduct of this analysis, we hypothesize a causal process for how a language model might leverage concepts during generation. Empirically, we find that linear concept erasure is successful in erasing most concept information under our framework for verbal number as well as some complex aspect-level sentiment concepts from a restaurant review dataset. Our causal intervention for controlled generation shows that, for at least one concept across two languages models, the concept subspace can be used to manipulate the concept value of the generated word with precision.
Paper Structure (40 sections, 3 theorems, 25 equations, 3 figures, 6 tables)

This paper contains 40 sections, 3 theorems, 25 equations, 3 figures, 6 tables.

Key Result

Theorem 4.1

Consider a joint distribution $p$ that factors as in fig:causal-graph-b, parameterized by orthogonal projection matrix $\boldsymbol{P}$. Under the distribution we have that $\boldsymbol{P}$ is an $\varepsilon$-eraser, $\boldsymbol{I}_d - \boldsymbol{P}$ is an $\varepsilon$-encapsulator, $\boldsymbol{I}_d - \boldsymbol{P}$ is an $\varepsilon$-container and $\boldsymbol{P}$ is an $\varepsilon$-stab

Figures (3)

  • Figure 1: Example of erasure of a verbal-number subspace, when predicting the next word given The kids. The representation space is two-dimensional with the $y$-axis representing the correct subspace encoding the concept ${{\color{violet}\texttt{verbal-number}}}$, while the $x$-axis encodes the lemma. Word representations are denoted with $\mathbf{e}$ and contextual representation with ${\color{gray} \boldsymbol{h}}$. On the left, we have the original representation space, and on the right, we have the space resulting from erasing information in our concept subspace, i.e., setting the $y$-coordinates of all vectors in the space to 0.
  • Figure 2: Causal graphical models that demonstrate how a concept may have a causal effect on word generation. Circles represent random variables and diamonds represent deterministic variables. ${\color{teal} \boldsymbol{X}}_{<t}, {\color{violet} C}, {\color{teal} X}$ represent the random variables for the textual context, the underlying concept, and the next word, respectively. ${\color{gray} \boldsymbol{H}}, {\color{gray} \boldsymbol{H}}_{\parallel}, {\color{gray} \boldsymbol{H}}_{\bot}$ are the representation at step $t$, its concept-related component, and its component whose concept-related information is erased by orthogonal projection matrix $\boldsymbol{P}$. \ref{['fig:causal-graph-a']} shows the traditional autoregressive causal structure for generation. \ref{['fig:causal-graph-b']} is our proposed causal structure for generation with a ${\color{violet} \mathcal{C}}$-valued latent variable ${\color{violet} C}$, with the backdoor path from ${\color{violet} C}$ to ${\color{gray} \boldsymbol{H}}$ shown in blue. \ref{['fig:causal-graph-c']} is the causal structure induced by a do-intervention on ${\color{violet} C}$. Finally, \ref{['fig:causal-graph-d']} is the causal structure implied by yang-klein-2021-fudge's (yang-klein-2021-fudge) concept-controlled generation approach.
  • Figure 3: Controlled generation experiment. Reported values are computed on (context, fact, foil) samples from the test split of our curated datasets of natural text used to train LEACE. Orig. Acc refers to the accuracy with which the model chooses fact over foil using original representations. Erased Acc. is the accuracy after erasure, using our counterfactual $q_{u}({\color{teal} x} | {\color{gray} \boldsymbol{h}}_\bot)$ distribution. Do Acc. measures, for example for Do( C=sg) (see \ref{['eq:do-intervention']}), the rate at which the intervention induces the model to assign higher probability to the sg element of the (fact, foil) pair over its pl counterpart, reported on aggregate over sg and pl contexts in the test set.

Theorems & Definitions (10)

  • Definition 3.1: Counterfactual Erasure
  • Definition 3.2: Counterfactual Encapsulation
  • Definition 3.3: Counterfactual Containment
  • Definition 3.4: Counterfactual Stability
  • Theorem 4.1
  • proof
  • Proposition A.0
  • proof
  • Theorem B.1
  • proof