Table of Contents
Fetching ...

Large Language Models can be Strong Self-Detoxifiers

Ching-Yun Ko, Pin-Yu Chen, Payel Das, Youssef Mroueh, Soham Dan, Georgios Kollias, Subhajit Chaudhury, Tejaswini Pedapati, Luca Daniel

TL;DR

Self-disciplined Autoregressive Sampling (SASA) is proposed, a lightweight controlled decoding algorithm for toxicity reduction of LLMs that markedly enhances the quality of the generated sentences relative to the original models and attains comparable performance to state-of-the-art detoxification techniques.

Abstract

Reducing the likelihood of generating harmful and toxic output is an essential task when aligning large language models (LLMs). Existing methods mainly rely on training an external reward model (i.e., another language model) or fine-tuning the LLM using self-generated data to influence the outcome. In this paper, we show that LLMs have the capability of self-detoxification without the use of an additional reward model or re-training. We propose \textit{Self-disciplined Autoregressive Sampling (SASA)}, a lightweight controlled decoding algorithm for toxicity reduction of LLMs. SASA leverages the contextual representations from an LLM to learn linear subspaces characterizing toxic v.s. non-toxic output in analytical forms. When auto-completing a response token-by-token, SASA dynamically tracks the margin of the current output to steer the generation away from the toxic subspace, by adjusting the autoregressive sampling strategy. Evaluated on LLMs of different scale and nature, namely Llama-3.1-Instruct (8B), Llama-2 (7B), and GPT2-L models with the RealToxicityPrompts, BOLD, and AttaQ benchmarks, SASA markedly enhances the quality of the generated sentences relative to the original models and attains comparable performance to state-of-the-art detoxification techniques, significantly reducing the toxicity level by only using the LLM's internal representations.

Large Language Models can be Strong Self-Detoxifiers

TL;DR

Self-disciplined Autoregressive Sampling (SASA) is proposed, a lightweight controlled decoding algorithm for toxicity reduction of LLMs that markedly enhances the quality of the generated sentences relative to the original models and attains comparable performance to state-of-the-art detoxification techniques.

Abstract

Reducing the likelihood of generating harmful and toxic output is an essential task when aligning large language models (LLMs). Existing methods mainly rely on training an external reward model (i.e., another language model) or fine-tuning the LLM using self-generated data to influence the outcome. In this paper, we show that LLMs have the capability of self-detoxification without the use of an additional reward model or re-training. We propose \textit{Self-disciplined Autoregressive Sampling (SASA)}, a lightweight controlled decoding algorithm for toxicity reduction of LLMs. SASA leverages the contextual representations from an LLM to learn linear subspaces characterizing toxic v.s. non-toxic output in analytical forms. When auto-completing a response token-by-token, SASA dynamically tracks the margin of the current output to steer the generation away from the toxic subspace, by adjusting the autoregressive sampling strategy. Evaluated on LLMs of different scale and nature, namely Llama-3.1-Instruct (8B), Llama-2 (7B), and GPT2-L models with the RealToxicityPrompts, BOLD, and AttaQ benchmarks, SASA markedly enhances the quality of the generated sentences relative to the original models and attains comparable performance to state-of-the-art detoxification techniques, significantly reducing the toxicity level by only using the LLM's internal representations.
Paper Structure (40 sections, 2 theorems, 15 equations, 4 figures, 12 tables)

This paper contains 40 sections, 2 theorems, 15 equations, 4 figures, 12 tables.

Key Result

Proposition 1

Let $\pi_{m}$ denote the scaled margin distribution derived from the learned subspace $f_v$. The weighted token sampling policy is the optimal solution for the optimization problem $\mathcal{P}$.

Figures (4)

  • Figure 1: Overview of SASA (self-disciplined autogressive sampling).
  • Figure 2: The toxicity-perplexity trade-off on different datasets.
  • Figure 3: An example of the decoding process of a toxic prompt with top token candidates selected by nucleus sampling. With the prompt $c$, there are five candidates for the next token {and, even, as, so, which} with the initial sampling probabilities being {0.58, 0.04, 0.04, 0.03, 0.31}, which becomes {0, 0.99, 0, 0, 0.01} after subspace adjustment.
  • Figure 4: The toxicity accuracy as a function of the sample size.

Theorems & Definitions (3)

  • Proposition 1
  • Proposition 1
  • proof