Table of Contents
Fetching ...

Causal ATE Mitigates Unintended Bias in Controlled Text Generation

Rahul Madhavan, Kahini Wadhawan

TL;DR

The paper tackles unintended bias in controlled text generation by modeling attribute control through Causal Average Treatment Effect (Causal ATE). It defines word- and sentence-level ATE metrics via counterfactual word substitutions and demonstrates a theoretical bound that spurious correlates have ATE values at most $0.25$, providing a principled basis for robustness. Empirically, it validates the approach on two toxicity datasets, showing reduced toxicity for protected groups and improved false-positive behavior, with code released for reproducibility. The work suggests that causal, perturbation-based ATE methods generalize to multiple attributes beyond toxicity, offering a scalable, transparent mechanism for safer, more robust language-model detoxification and attribute control.

Abstract

We study attribute control in language models through the method of Causal Average Treatment Effect (Causal ATE). Existing methods for the attribute control task in Language Models (LMs) check for the co-occurrence of words in a sentence with the attribute of interest, and control for them. However, spurious correlation of the words with the attribute in the training dataset, can cause models to hallucinate the presence of the attribute when presented with the spurious correlate during inference. We show that the simple perturbation-based method of Causal ATE removes this unintended effect. Specifically, we ground it in the problem of toxicity mitigation, where a significant challenge lies in the inadvertent bias that often emerges towards protected groups post detoxification. We show that this unintended bias can be solved by the use of the Causal ATE metric and rigorously prove our claim. We provide experimental validations for our claims and release our code (anonymously) here: https://github.com/causalate-mitigates-bias/causal-ate-mitigates-bias.

Causal ATE Mitigates Unintended Bias in Controlled Text Generation

TL;DR

The paper tackles unintended bias in controlled text generation by modeling attribute control through Causal Average Treatment Effect (Causal ATE). It defines word- and sentence-level ATE metrics via counterfactual word substitutions and demonstrates a theoretical bound that spurious correlates have ATE values at most , providing a principled basis for robustness. Empirically, it validates the approach on two toxicity datasets, showing reduced toxicity for protected groups and improved false-positive behavior, with code released for reproducibility. The work suggests that causal, perturbation-based ATE methods generalize to multiple attributes beyond toxicity, offering a scalable, transparent mechanism for safer, more robust language-model detoxification and attribute control.

Abstract

We study attribute control in language models through the method of Causal Average Treatment Effect (Causal ATE). Existing methods for the attribute control task in Language Models (LMs) check for the co-occurrence of words in a sentence with the attribute of interest, and control for them. However, spurious correlation of the words with the attribute in the training dataset, can cause models to hallucinate the presence of the attribute when presented with the spurious correlate during inference. We show that the simple perturbation-based method of Causal ATE removes this unintended effect. Specifically, we ground it in the problem of toxicity mitigation, where a significant challenge lies in the inadvertent bias that often emerges towards protected groups post detoxification. We show that this unintended bias can be solved by the use of the Causal ATE metric and rigorously prove our claim. We provide experimental validations for our claims and release our code (anonymously) here: https://github.com/causalate-mitigates-bias/causal-ate-mitigates-bias.
Paper Structure (24 sections, 3 theorems, 8 equations, 6 figures, 3 tables)

This paper contains 24 sections, 3 theorems, 8 equations, 6 figures, 3 tables.

Key Result

Lemma 1

Consider sentence $s =\{w_1,\dots,w_k\}$. We will make two simple claims: This lemma is straightforward to prove from Definition eqn:ATE of sentence.

Figures (6)

  • Figure 1: We plot the ATE score vs a regression based classifier for toxicity across two datasets. ATE Scores show a lower toxicity for protected groups.
  • Figure 2: An Illustration of the Causal Graph used to compute the attribute score of a sentence recursively.
  • Figure 3: Graph of ATE score of a given word $w_i$ with $\widehat{a}(w_i)$ given two cases
  • Figure 4: Illustration of how perturbation of the words in a sentence may be used to identify the most important words with respect to an attribute.
  • Figure 5: Graph of ATE score of a given word $w_i$ with $\widehat{a}(w_i)$ given two cases
  • ...and 1 more figures

Theorems & Definitions (9)

  • Definition 1: Attribute model $\widehat{a} (w_i)$ for any word $w_i \in W$
  • Definition 2: Attribute model $\widehat{A}(s)$ for a sentence $s \in W^k$
  • Definition 3: Treatment Effect (TE) of a word in a sentence given replacement word
  • Definition 4: $\texttt{ATE}$ of word $w_i$ given dataset $\mathcal{D}$ and an attribute classifier $f(\cdot)$
  • Lemma 1
  • Theorem 1
  • proof
  • Theorem
  • proof