Table of Contents
Fetching ...

Mitigating Text Toxicity with Counterfactual Generation

Milan Bhan, Jean-Noel Vittaut, Nina Achache, Victor Legrand, Nicolas Chesneau, Annabelle Blangero, Juliette Murris, Marie-Jeanne Lesot

TL;DR

The paper tackles automatic toxicity mitigation by rewriting toxic text while preserving the original non-toxic meaning. It leverages XAI techniques—local feature importance and counterfactual generation—to identify toxic tokens and produce detoxified counterfactuals via a target-then-replace approach, implemented as $CF\text{-}Detox_{\text{tigtec}}$. Across three datasets, automatic and human evaluations show competitive toxicity reduction and notably better content preservation compared to several baselines, while also discussing ethical risks and advocating human-in-the-loop oversight. The work further demonstrates how Counterfactual Feature Importance can refine detoxifications to increase sparsity and similarity to the original text, bridging explainable AI and practical toxicity processing for more robust applications.

Abstract

Toxicity mitigation consists in rephrasing text in order to remove offensive or harmful meaning. Neural natural language processing (NLP) models have been widely used to target and mitigate textual toxicity. However, existing methods fail to detoxify text while preserving the initial non-toxic meaning at the same time. In this work, we propose to apply counterfactual generation methods from the eXplainable AI (XAI) field to target and mitigate textual toxicity. In particular, we perform text detoxification by applying local feature importance and counterfactual generation methods to a toxicity classifier distinguishing between toxic and non-toxic texts. We carry out text detoxification through counterfactual generation on three datasets and compare our approach to three competitors. Automatic and human evaluations show that recently developed NLP counterfactual generators can mitigate toxicity accurately while better preserving the meaning of the initial text as compared to classical detoxification methods. Finally, we take a step back from using automated detoxification tools, and discuss how to manage the polysemous nature of toxicity and the risk of malicious use of detoxification tools. This work is the first to bridge the gap between counterfactual generation and text detoxification and paves the way towards more practical application of XAI methods.

Mitigating Text Toxicity with Counterfactual Generation

TL;DR

The paper tackles automatic toxicity mitigation by rewriting toxic text while preserving the original non-toxic meaning. It leverages XAI techniques—local feature importance and counterfactual generation—to identify toxic tokens and produce detoxified counterfactuals via a target-then-replace approach, implemented as . Across three datasets, automatic and human evaluations show competitive toxicity reduction and notably better content preservation compared to several baselines, while also discussing ethical risks and advocating human-in-the-loop oversight. The work further demonstrates how Counterfactual Feature Importance can refine detoxifications to increase sparsity and similarity to the original text, bridging explainable AI and practical toxicity processing for more robust applications.

Abstract

Toxicity mitigation consists in rephrasing text in order to remove offensive or harmful meaning. Neural natural language processing (NLP) models have been widely used to target and mitigate textual toxicity. However, existing methods fail to detoxify text while preserving the initial non-toxic meaning at the same time. In this work, we propose to apply counterfactual generation methods from the eXplainable AI (XAI) field to target and mitigate textual toxicity. In particular, we perform text detoxification by applying local feature importance and counterfactual generation methods to a toxicity classifier distinguishing between toxic and non-toxic texts. We carry out text detoxification through counterfactual generation on three datasets and compare our approach to three competitors. Automatic and human evaluations show that recently developed NLP counterfactual generators can mitigate toxicity accurately while better preserving the meaning of the initial text as compared to classical detoxification methods. Finally, we take a step back from using automated detoxification tools, and discuss how to manage the polysemous nature of toxicity and the risk of malicious use of detoxification tools. This work is the first to bridge the gap between counterfactual generation and text detoxification and paves the way towards more practical application of XAI methods.
Paper Structure (40 sections, 1 equation, 6 figures, 3 tables)

This paper contains 40 sections, 1 equation, 6 figures, 3 tables.

Figures (6)

  • Figure 1: Example of one text detoxification through counterfactual generation with our proposed CF-Detox$_{\text{tigtec}}$ compared to MaRCohallinan_detoxifying_2023 and CondBERTdale_text_2021. Text changes to mitigate toxicity are highlighted in blue. Explicitly toxic words have been censored with *.
  • Figure 2: Illustrative example of a toxic text and a CF detoxification process with our proposed target-then-replace-then-refine approach. (1) Toxic content (in red) is targeted and (2) then modified (in blue). To be more content preserving, the counterfactual example is finally refined (3) by restarting the counterfactual generation process, guided by counterfactual feature importance. Explicitly toxic words have been censored with *. The darker the shade of color (red for LFI, blue for CFI), the higher the importance.
  • Figure 3: Toxicity mitigation and counterfactual generation comparison by method category. Toxicity mitigation methods and counterfactual generators can be categorized as steered text generation and target-then-replace approaches. Neural NLP models used to generate text or replace tokens are similar or of the same nature.
  • Figure 4: Toxicity comparison on three test sets with a human-grounded experimental ranking evaluation. Competitor rank distributions are compared to CF-Detox$_{\text{tigtec}}$ using a one-tailed paired t-test with a 5% threshold risk. The sign "-" indicates that the rank is lower in average as compared to CF-Detox$_{\text{tigtec}}$, whereas "=" and "+" respectively indicate that the ranking is in average similar and greater.
  • Figure 5: Evolution of the gain in sparsity and similarity induced by CFI on Dynahate, according to the number of tokens initially modified, per model.
  • ...and 1 more figures