Table of Contents
Fetching ...

Fine-Grained Detoxification via Instance-Level Prefixes for Large Language Models

Xin Yi, Linlin Wang, Xiaoling Wang, Liang He

TL;DR

This paper addresses toxic generation in large language models by introducing fine-grained detoxification via instance-level prefixes (FGDILP), a data-free framework that uses self-generated prefixes to build fine-grained subtoxicity vectors in the attention space. It constructs a positive prefix and multiple negative prefixes from self-generation and self-diagnosis, then computes layer-wise subtoxicity vectors $\Delta_j^l$ and fuses them through masking, symbolization, and alignment to steer the raw prompt toward non-toxic outputs, yielding a detoxified vector $v_P^l$ for each layer. FG DILP is evaluated on RealToxicityPrompts and FFT against strong baselines, showing superior toxicity reduction at both utterance- and context-levels, with modest trade-offs in fluency and diversity, and validated by human judgments. The approach demonstrates robustness across model sizes and market-ready potential due to its lightweight, training-free design, while acknowledging limitations related to self-diagnosis accuracy, reliance on toxicity evaluators, and template-based classification.

Abstract

Impressive results have been achieved in natural language processing (NLP) tasks through the training of large language models (LLMs). However, these models occasionally produce toxic content such as insults, threats, and profanity in response to certain prompts, thereby constraining their practical utility. To tackle this issue, various finetuning-based and decoding-based approaches have been utilized to mitigate toxicity. However, these methods typically necessitate additional costs such as high-quality training data or auxiliary models. In this paper, we propose fine-grained detoxification via instance-level prefixes (FGDILP) to mitigate toxic text without additional cost. Specifically, FGDILP contrasts the contextualized representation in attention space using a positive prefix-prepended prompt against multiple negative prefix-prepended prompts at the instance level. This allows for constructing fine-grained subtoxicity vectors, which enables collaborative detoxification by fusing them to correct the normal generation process when provided with a raw prompt. We validate that FGDILP enables controlled text generation with regard to toxicity at both the utterance and context levels. Our method surpasses prompt-based baselines in detoxification, although at a slight cost to generation fluency and diversity.

Fine-Grained Detoxification via Instance-Level Prefixes for Large Language Models

TL;DR

This paper addresses toxic generation in large language models by introducing fine-grained detoxification via instance-level prefixes (FGDILP), a data-free framework that uses self-generated prefixes to build fine-grained subtoxicity vectors in the attention space. It constructs a positive prefix and multiple negative prefixes from self-generation and self-diagnosis, then computes layer-wise subtoxicity vectors and fuses them through masking, symbolization, and alignment to steer the raw prompt toward non-toxic outputs, yielding a detoxified vector for each layer. FG DILP is evaluated on RealToxicityPrompts and FFT against strong baselines, showing superior toxicity reduction at both utterance- and context-levels, with modest trade-offs in fluency and diversity, and validated by human judgments. The approach demonstrates robustness across model sizes and market-ready potential due to its lightweight, training-free design, while acknowledging limitations related to self-diagnosis accuracy, reliance on toxicity evaluators, and template-based classification.

Abstract

Impressive results have been achieved in natural language processing (NLP) tasks through the training of large language models (LLMs). However, these models occasionally produce toxic content such as insults, threats, and profanity in response to certain prompts, thereby constraining their practical utility. To tackle this issue, various finetuning-based and decoding-based approaches have been utilized to mitigate toxicity. However, these methods typically necessitate additional costs such as high-quality training data or auxiliary models. In this paper, we propose fine-grained detoxification via instance-level prefixes (FGDILP) to mitigate toxic text without additional cost. Specifically, FGDILP contrasts the contextualized representation in attention space using a positive prefix-prepended prompt against multiple negative prefix-prepended prompts at the instance level. This allows for constructing fine-grained subtoxicity vectors, which enables collaborative detoxification by fusing them to correct the normal generation process when provided with a raw prompt. We validate that FGDILP enables controlled text generation with regard to toxicity at both the utterance and context levels. Our method surpasses prompt-based baselines in detoxification, although at a slight cost to generation fluency and diversity.
Paper Structure (41 sections, 5 equations, 9 figures, 15 tables)

This paper contains 41 sections, 5 equations, 9 figures, 15 tables.

Figures (9)

  • Figure 1: Overview of FGDILP. In Step 1-1, multiple outputs are sampled from the model through self-generation. In Step 1-2, all outputs are categorized into toxic (negative) and nontoxic (positive) prefixes by self-diagnosis. In Step 2, a positive prefix and multiple negative prefixes (one for each subtoxicity) are prepended with the raw prompt to form a batch. During the forward pass, their contextualized representations are compared to construct all subtoxicity vectors, which are then fused into one in attention space. In step 3, information flow is corrected when the raw prompt passes through the model for detoxification.
  • Figure 2: Human evaluation for RealToxicityPrompts.
  • Figure 3: Subtoxic behaviors of detoxified text. We measure the fine-grained subtoxicities using Perspective API, with Llama-2-7B as the base model.
  • Figure 4: (a) Detoxification by keeping top-K% high-magnitude values or bottom-K% low-magnitude values. (b) The ratio of conflicting element positions among subtoxicity vectors.
  • Figure 5: Human evaluation for FFT based on Llama-2-chat-7B.
  • ...and 4 more figures