DESTEIN: Navigating Detoxification of Language Models via Universal Steering Pairs and Head-wise Activation Fusion

Yu Li; Han Jiang; Chuanyang Gong; Zhihua Wei

DESTEIN: Navigating Detoxification of Language Models via Universal Steering Pairs and Head-wise Activation Fusion

Yu Li, Han Jiang, Chuanyang Gong, Zhihua Wei

TL;DR

DeStein addresses the challenge of detoxifying large language models with low resource overhead by performing activation-space detoxification using self-generated universal steering pairs. It computes detoxification vectors $z$ from activation differences and fuses them at the head level during inference with tuning-free mechanisms: $\hat{h}(x) = h(x) + \alpha_{contr} z$ and, with probing weights, $\hat{h}(x) = h(x) + \alpha_{prob} \alpha_{contr} z$. Empirical results on RealToxicityPrompts across GPT2-large and multiple LLMS show state-of-the-art detoxification while preserving fluency and diversity and achieving scalability with minimal inference-time overhead. The approach offers interpretability via probing analyses and activation-space visualizations, and is open-sourced for practical deployment, though it relies on linear representational assumptions and parallel data generation.

Abstract

Despite the remarkable achievements of language models (LMs) across a broad spectrum of tasks, their propensity for generating toxic outputs remains a prevalent concern. Current solutions involving finetuning or auxiliary models usually require extensive computational resources, hindering their practicality in large language models (LLMs). In this paper, we propose DeStein, a novel method that detoxifies LMs by applying representation engineering in activation spaces with lower resource and time costs. Specifically, we derive detoxification vectors from self-induced, universal steering pairs through arithmetic operations in activation spaces. During inference, detoxification is achieved by fusing the detoxification vectors with the original representations in a head-wise manner. Empirical results demonstrate that our method significantly outperforms previous state-of-the-art approaches on various metrics, while also maintaining satisfactory generation quality and diversity. We further validate the practicality and scalability of DeStein with a series of white-box LLMs. The method is open-sourced at https://github.com/LizLizLi/DeStein. Warning: Some example model outputs may contain highly offensive or disturbing text.

DESTEIN: Navigating Detoxification of Language Models via Universal Steering Pairs and Head-wise Activation Fusion

TL;DR

from activation differences and fuses them at the head level during inference with tuning-free mechanisms:

and, with probing weights,

. Empirical results on RealToxicityPrompts across GPT2-large and multiple LLMS show state-of-the-art detoxification while preserving fluency and diversity and achieving scalability with minimal inference-time overhead. The approach offers interpretability via probing analyses and activation-space visualizations, and is open-sourced for practical deployment, though it relies on linear representational assumptions and parallel data generation.

Abstract

Paper Structure (25 sections, 6 equations, 3 figures, 14 tables)

This paper contains 25 sections, 6 equations, 3 figures, 14 tables.

Introduction
Related works
Methods
Formalization and Preliminaries
Universal steering pairs generation
Head-wise activation fusion with probing techniques
Experiments
Experimental settings
Evaluation results
Ablation study
Further analysis
Trade-off between detoxification and task performance in LLMs
Analysis on the influence of detoxification strength
Analysis on interpretability in activation spaces
Conclusions
...and 10 more sections

Figures (3)

Figure 1: An illustration of DeStein. Detoxification vectors are synthesized from self-induced steering pairs in activation spaces. During inference, these vectors are then integrated with head-wise probes to perform detoxification.
Figure 2: Trade-off between detoxification strength and PPL on GPT2-large.
Figure 3: (a) Linear probe accuracy of GPT2-large's heads on the validation set, with deep red showing higher accuracy. (b) and (c) show toxic and non-toxic statement representations in the 6th head of the 23rd layer and the 7th head of the 12th layer in GPT2-large.

DESTEIN: Navigating Detoxification of Language Models via Universal Steering Pairs and Head-wise Activation Fusion

TL;DR

Abstract

DESTEIN: Navigating Detoxification of Language Models via Universal Steering Pairs and Head-wise Activation Fusion

Authors

TL;DR

Abstract

Table of Contents

Figures (3)