Parameter-Efficient Detoxification with Contrastive Decoding

Tong Niu; Caiming Xiong; Semih Yavuz; Yingbo Zhou

Parameter-Efficient Detoxification with Contrastive Decoding

Tong Niu, Caiming Xiong, Semih Yavuz, Yingbo Zhou

TL;DR

DETOXIGEN tackles unsafe generation in large language models by coupling a frozen generator with a detoxifier trained on toxic data using prompt tuning. It applies a contrastive decoding strategy that adjusts the output distribution via $P(x_t\mid x_{<t}) = P_{GEN}(x_t\mid x_{<t}) + \alpha\Delta P$, where $\Delta P = P_{GEN} - P_{DE}$, with clipping and normalization to maintain valid probabilities; Top-$p$ sampling confines decisions to $V^{(p)}$. On RealToxicityPrompts, DETOXIGEN significantly reduces toxicity while preserving generation quality, with diagonal same-backbone pairings yielding the strongest performance and larger model-size disparities reducing efficacy. The approach is parameter-efficient, requiring only soft prompts (virtual tokens) and no full fine-tuning, and demonstrates transferability across GPT-2 and Llama-2 backbones, indicating practical applicability for safe deployment of large language models.

Abstract

The field of natural language generation has witnessed significant advancements in recent years, including the development of controllable text generation techniques. However, controlling the attributes of the generated text remains a challenge, especially when aiming to avoid undesirable behavior such as toxicity. In this work, we introduce Detoxification Generator (DETOXIGEN), an inference-time algorithm that steers the generation away from unwanted styles. DETOXIGEN is an ensemble of a pre-trained language model (generator) and a detoxifier. The detoxifier is trained intentionally on the toxic data representative of the undesirable attribute, encouraging it to generate text in that style exclusively. During the actual generation, we use the trained detoxifier to produce undesirable tokens for the generator to contrast against at each decoding step. This approach directly informs the generator to avoid generating tokens that the detoxifier considers highly likely. We evaluate DETOXIGEN on the commonly used REALTOXICITYPROMPTS benchmark (Gehman et al., 2020) with various language models as generators. We find that it significantly outperforms previous approaches in detoxification metrics while not compromising on the generation quality. Moreover, the detoxifier is obtained by soft prompt-tuning using the same backbone language model as the generator. Hence, DETOXIGEN requires only a tiny amount of extra weights from the virtual tokens of the detoxifier to be loaded into GPU memory while decoding, making it a promising lightweight, practical, and parameter-efficient detoxification strategy.

Parameter-Efficient Detoxification with Contrastive Decoding

TL;DR

, where

, with clipping and normalization to maintain valid probabilities; Top-

sampling confines decisions to

. On RealToxicityPrompts, DETOXIGEN significantly reduces toxicity while preserving generation quality, with diagonal same-backbone pairings yielding the strongest performance and larger model-size disparities reducing efficacy. The approach is parameter-efficient, requiring only soft prompts (virtual tokens) and no full fine-tuning, and demonstrates transferability across GPT-2 and Llama-2 backbones, indicating practical applicability for safe deployment of large language models.

Abstract

Paper Structure (31 sections, 6 equations, 1 figure, 6 tables)

This paper contains 31 sections, 6 equations, 1 figure, 6 tables.

Introduction
Model
Task Formulation
Model Components
Generator
Detoxifier
Sampling
Parameter-Efficient Training of Detoxifier
Experimental Setup
Backbone Models
Training of Detoxifier
Hyperparameter Tuning
Evaluation Data
Metrics
Toxicity
...and 16 more sections

Figures (1)

Figure 1: Illustration of the DetoxiGen pipeline that avoids generating a gender-biased next token. A prompt is fed into both the generator and the detoxifier, which share the same underlying frozen weights from the backbone language model. Additionally, the detoxifier contains virtual tokens whose embeddings are trainable. Such virtual tokens steer the detoxifier toward generating only toxic continuations. Each model provides its own probability distribution for the next token, where DetoxiGen combines the two distributions and performs the detoxification.

Parameter-Efficient Detoxification with Contrastive Decoding

TL;DR

Abstract

Parameter-Efficient Detoxification with Contrastive Decoding

Authors

TL;DR

Abstract

Table of Contents

Figures (1)