Table of Contents
Fetching ...

Explaining Text Classifiers with Counterfactual Representations

Pirmin Lemberger, Antoine Saillenfest

TL;DR

This paper proposes a simple method for generating counterfactuals by intervening in the space of text representations which bypasses this limitation and argues that its interventions are minimally disruptive and that they are theoretically sound as they align with counterfactuals as defined in Pearl's causal inference framework.

Abstract

One well motivated explanation method for classifiers leverages counterfactuals which are hypothetical events identical to real observations in all aspects except for one feature. Constructing such counterfactual poses specific challenges for texts, however, as some attribute values may not necessarily align with plausible real-world events. In this paper we propose a simple method for generating counterfactuals by intervening in the space of text representations which bypasses this limitation. We argue that our interventions are minimally disruptive and that they are theoretically sound as they align with counterfactuals as defined in Pearl's causal inference framework. To validate our method, we conducted experiments first on a synthetic dataset and then on a realistic dataset of counterfactuals. This allows for a direct comparison between classifier predictions based on ground truth counterfactuals - obtained through explicit text interventions - and our counterfactuals, derived through interventions in the representation space. Eventually, we study a real world scenario where our counterfactuals can be leveraged both for explaining a classifier and for bias mitigation.

Explaining Text Classifiers with Counterfactual Representations

TL;DR

This paper proposes a simple method for generating counterfactuals by intervening in the space of text representations which bypasses this limitation and argues that its interventions are minimally disruptive and that they are theoretically sound as they align with counterfactuals as defined in Pearl's causal inference framework.

Abstract

One well motivated explanation method for classifiers leverages counterfactuals which are hypothetical events identical to real observations in all aspects except for one feature. Constructing such counterfactual poses specific challenges for texts, however, as some attribute values may not necessarily align with plausible real-world events. In this paper we propose a simple method for generating counterfactuals by intervening in the space of text representations which bypasses this limitation. We argue that our interventions are minimally disruptive and that they are theoretically sound as they align with counterfactuals as defined in Pearl's causal inference framework. To validate our method, we conducted experiments first on a synthetic dataset and then on a realistic dataset of counterfactuals. This allows for a direct comparison between classifier predictions based on ground truth counterfactuals - obtained through explicit text interventions - and our counterfactuals, derived through interventions in the representation space. Eventually, we study a real world scenario where our counterfactuals can be leveraged both for explaining a classifier and for bias mitigation.
Paper Structure (37 sections, 14 equations, 4 figures, 13 tables)

This paper contains 37 sections, 14 equations, 4 figures, 13 tables.

Figures (4)

  • Figure 1: The representation space when $Z$ takes $k=2$ values. Representations of texts for which $Z(s)=z_0$ are shown as $+$ and those for which $Z(s)=z_1$ as $-$, they form two clusters. The representation $x$ is associated with a text for which $Z=z_0$. Once projected by $\mathbf{P}$ on $V^\perp$ we obtain a representation $x^\perp$ from which it is impossible to recover the value $z$ of the protected attribute $Z$ using a linear predictor. This information is contained in $x^\parallel$. Our CFRs $x_{Z\leftarrow z_0}$ and $x_{Z\leftarrow z_1}$ for $x$ corresponding to setting $Z=z_0$ or $z_1$ are obtained by regressing $x^\parallel$ on $x^\perp$ on observations for which $Z=z_0$ and $z_1$ respectively (oblique dashed lines). The random variable $X_{Z\leftarrow z_1}(x)$ is the $Z=z_1$ non deterministic Pearl counterfactual for $x$. Its expectation value corresponds to our CFR $x_{Z\leftarrow z_1}(x)$. The remaining notations are defined in equations (\ref{['eq_X_Z']}), (\ref{['eq_reglin2']}) and (\ref{['eq_SCM']}).
  • Figure 2: The DAG $G$ which corresponds to the SCM generating model of text documents.
  • Figure 3: Evolution for aggressive training scenarios of $\mathrm{ATE}_{\widehat{Y}}[\mathcal{S}_n]$ and $\widehat{\mathrm{ATE}}_{\widehat{Y}}[\mathcal{S}_n]$ (top) of the correlation coefficient $\rho$ (middle) and the linear regression coefficient $\alpha$ (bottom) between $\mathrm{TE}_{\widehat{Y}}$ and $\widehat{\mathrm{TE}}_{\widehat{Y}}$ in $\mathcal{S}_n$ vs. the fraction $|\mathcal{S}_n|/|\mathcal{S}|$ (in %) of included observations. The dotted vertical lines corresponds to a maximal fraction of observations above which the correlation coefficient $\rho$ falls below $0.75$ and $0.5$ respectively.
  • Figure 4: $\mathrm{TPR}\text{-Gap}_{\text{female}, y}$ vs. the proportion of females for each occupation $y$ for each representation used during classifier training: original $X$, scrubbed representations $X^{\perp}$ and the augmented set $X$ + CFRs. Each set of vertically aligned points corresponds to an occupation $y$ (e.g. psychologist). Correlation and regression coefficients: $X$ 0.81, 0.50; $X^\perp$ 0.66, 0.32; $X$ + CFRs 0.69, 0.36.