Table of Contents
Fetching ...

DAPI: Domain Adaptive Toxicity Probe Vector Intervention for Fine-Grained Detoxification

Cho Hyeonsu, Dooyoung Kim, Youngjoong Ko

TL;DR

This work tackles fine-grained toxicity control in pretrained language models by extending linear-probe detoxification to multiple category-specific toxicity directions. It introduces Domain-Adaptive Toxicity Probe Vector Intervention (DAPI), which trains multiple toxicity probe vectors, dynamically selects the most relevant one at each generation step via cosine similarity to the hidden state, and applies a per-token, dynamically scaled subtraction from the last-layer representation. A Cosine Similarity Regularization Loss encourages distinctiveness among category probes, enabling effective detoxification across both majority and minority toxicity categories. Experiments on REALTOXICITYPROMPTS and related datasets show up to 78.52% toxicity reduction with minimal fluency loss, outperforming single-probe and decoding-time baselines, and ablations confirm the value of per-category probing and dynamic scaling for preserving text quality.

Abstract

There have been attempts to utilize linear probe for detoxification, with existing studies relying on a single toxicity probe vector to reduce toxicity. However, toxicity can be fine-grained into various subcategories, making it difficult to remove certain types of toxicity by using a single toxicity probe vector. To address this limitation, we propose a category-specific toxicity probe vector approach. First, we train multiple toxicity probe vectors for different toxicity categories. During generation, we dynamically select the most relevant toxicity probe vector based on the current context. Finally, the selected vector is dynamically scaled and subtracted from model. Our method successfully mitigated toxicity from categories that the single probe vector approach failed to detoxify. Experiments demonstrate that our approach achieves up to a 78.52% reduction in toxicity on the evaluation dataset, while fluency remains nearly unchanged, with only a 0.052% drop compared to the unsteered model.

DAPI: Domain Adaptive Toxicity Probe Vector Intervention for Fine-Grained Detoxification

TL;DR

This work tackles fine-grained toxicity control in pretrained language models by extending linear-probe detoxification to multiple category-specific toxicity directions. It introduces Domain-Adaptive Toxicity Probe Vector Intervention (DAPI), which trains multiple toxicity probe vectors, dynamically selects the most relevant one at each generation step via cosine similarity to the hidden state, and applies a per-token, dynamically scaled subtraction from the last-layer representation. A Cosine Similarity Regularization Loss encourages distinctiveness among category probes, enabling effective detoxification across both majority and minority toxicity categories. Experiments on REALTOXICITYPROMPTS and related datasets show up to 78.52% toxicity reduction with minimal fluency loss, outperforming single-probe and decoding-time baselines, and ablations confirm the value of per-category probing and dynamic scaling for preserving text quality.

Abstract

There have been attempts to utilize linear probe for detoxification, with existing studies relying on a single toxicity probe vector to reduce toxicity. However, toxicity can be fine-grained into various subcategories, making it difficult to remove certain types of toxicity by using a single toxicity probe vector. To address this limitation, we propose a category-specific toxicity probe vector approach. First, we train multiple toxicity probe vectors for different toxicity categories. During generation, we dynamically select the most relevant toxicity probe vector based on the current context. Finally, the selected vector is dynamically scaled and subtracted from model. Our method successfully mitigated toxicity from categories that the single probe vector approach failed to detoxify. Experiments demonstrate that our approach achieves up to a 78.52% reduction in toxicity on the evaluation dataset, while fluency remains nearly unchanged, with only a 0.052% drop compared to the unsteered model.

Paper Structure

This paper contains 30 sections, 4 equations, 3 figures, 9 tables.

Figures (3)

  • Figure 1: Continuations of the same non-toxic prompt from GPT-2 Large, generated with (red) single toxicity probe vector and with (blue) Domain-Adaptive Toxicity Probe Vector Intervention (DAPI). Without DAPI, the model produced toxic content despite the non-toxic prompt. In contrast, our method successfully prevented toxicity while maintaining fluency. WARNING: THESE EXAMPLES ARE HIGHLY OFFENSIVE
  • Figure 2: An overview of DAPI. It consists of three steps: Step1: Extracting Probe Vector: A linear classifier is trained using the last hidden state of the model to obtain probe vectors that represent distinct directional attributes for each toxicity category. Step2: Probe Vector Selection: At every time step $t$ during inference, among the acquired probe vectors $W$, the one most similar to the averaged hidden states before FFN, $X_{avg}$ is selected. Step3: Detoxification Using Probe Vector: The selected probe vector is then scaled by a dynamic scaling factor and subtracted from the last hidden state $\tilde{X_t}$ in the model’s last layer.
  • Figure 3: Results of human evaluation. 'Less toxic' means continuation is less toxic, 'More natural' means continuation is more natural and contextually coherent.