DESTEIN: Navigating Detoxification of Language Models via Universal Steering Pairs and Head-wise Activation Fusion
Yu Li, Han Jiang, Chuanyang Gong, Zhihua Wei
TL;DR
DeStein addresses the challenge of detoxifying large language models with low resource overhead by performing activation-space detoxification using self-generated universal steering pairs. It computes detoxification vectors $z$ from activation differences and fuses them at the head level during inference with tuning-free mechanisms: $\hat{h}(x) = h(x) + \alpha_{contr} z$ and, with probing weights, $\hat{h}(x) = h(x) + \alpha_{prob} \alpha_{contr} z$. Empirical results on RealToxicityPrompts across GPT2-large and multiple LLMS show state-of-the-art detoxification while preserving fluency and diversity and achieving scalability with minimal inference-time overhead. The approach offers interpretability via probing analyses and activation-space visualizations, and is open-sourced for practical deployment, though it relies on linear representational assumptions and parallel data generation.
Abstract
Despite the remarkable achievements of language models (LMs) across a broad spectrum of tasks, their propensity for generating toxic outputs remains a prevalent concern. Current solutions involving finetuning or auxiliary models usually require extensive computational resources, hindering their practicality in large language models (LLMs). In this paper, we propose DeStein, a novel method that detoxifies LMs by applying representation engineering in activation spaces with lower resource and time costs. Specifically, we derive detoxification vectors from self-induced, universal steering pairs through arithmetic operations in activation spaces. During inference, detoxification is achieved by fusing the detoxification vectors with the original representations in a head-wise manner. Empirical results demonstrate that our method significantly outperforms previous state-of-the-art approaches on various metrics, while also maintaining satisfactory generation quality and diversity. We further validate the practicality and scalability of DeStein with a series of white-box LLMs. The method is open-sourced at https://github.com/LizLizLi/DeStein. Warning: Some example model outputs may contain highly offensive or disturbing text.
