Table of Contents
Fetching ...

Unveiling the Implicit Toxicity in Large Language Models

Jiaxin Wen, Pei Ke, Hao Sun, Zhexin Zhang, Chengfei Li, Jinfeng Bai, Minlie Huang

TL;DR

The paper investigates implicit toxicity in large language models, revealing that LLMs can generate toxic outputs that evade current detectors. It introduces an RL-based attack framework, built on a supervised warm-start, a reward model, and PPO optimization, to maximize implicit toxicity. The authors demonstrate substantial attack success across multiple toxicity detectors and show that retraining detectors on attack-generated data improves detection without sacrificing performance on existing benchmarks. They also analyze the linguistic and social factors enabling implicit toxicity and propose defenses via augmented training data and classifier refinement. The work highlights a critical safety risk and provides a concrete defense pathway, while noting limitations in reward data quality and scalability to ultra-large models.

Abstract

The open-endedness of large language models (LLMs) combined with their impressive capabilities may lead to new safety issues when being exploited for malicious use. While recent studies primarily focus on probing toxic outputs that can be easily detected with existing toxicity classifiers, we show that LLMs can generate diverse implicit toxic outputs that are exceptionally difficult to detect via simply zero-shot prompting. Moreover, we propose a reinforcement learning (RL) based attacking method to further induce the implicit toxicity in LLMs. Specifically, we optimize the language model with a reward that prefers implicit toxic outputs to explicit toxic and non-toxic ones. Experiments on five widely-adopted toxicity classifiers demonstrate that the attack success rate can be significantly improved through RL fine-tuning. For instance, the RL-finetuned LLaMA-13B model achieves an attack success rate of 90.04% on BAD and 62.85% on Davinci003. Our findings suggest that LLMs pose a significant threat in generating undetectable implicit toxic outputs. We further show that fine-tuning toxicity classifiers on the annotated examples from our attacking method can effectively enhance their ability to detect LLM-generated implicit toxic language. The code is publicly available at https://github.com/thu-coai/Implicit-Toxicity.

Unveiling the Implicit Toxicity in Large Language Models

TL;DR

The paper investigates implicit toxicity in large language models, revealing that LLMs can generate toxic outputs that evade current detectors. It introduces an RL-based attack framework, built on a supervised warm-start, a reward model, and PPO optimization, to maximize implicit toxicity. The authors demonstrate substantial attack success across multiple toxicity detectors and show that retraining detectors on attack-generated data improves detection without sacrificing performance on existing benchmarks. They also analyze the linguistic and social factors enabling implicit toxicity and propose defenses via augmented training data and classifier refinement. The work highlights a critical safety risk and provides a concrete defense pathway, while noting limitations in reward data quality and scalability to ultra-large models.

Abstract

The open-endedness of large language models (LLMs) combined with their impressive capabilities may lead to new safety issues when being exploited for malicious use. While recent studies primarily focus on probing toxic outputs that can be easily detected with existing toxicity classifiers, we show that LLMs can generate diverse implicit toxic outputs that are exceptionally difficult to detect via simply zero-shot prompting. Moreover, we propose a reinforcement learning (RL) based attacking method to further induce the implicit toxicity in LLMs. Specifically, we optimize the language model with a reward that prefers implicit toxic outputs to explicit toxic and non-toxic ones. Experiments on five widely-adopted toxicity classifiers demonstrate that the attack success rate can be significantly improved through RL fine-tuning. For instance, the RL-finetuned LLaMA-13B model achieves an attack success rate of 90.04% on BAD and 62.85% on Davinci003. Our findings suggest that LLMs pose a significant threat in generating undetectable implicit toxic outputs. We further show that fine-tuning toxicity classifiers on the annotated examples from our attacking method can effectively enhance their ability to detect LLM-generated implicit toxic language. The code is publicly available at https://github.com/thu-coai/Implicit-Toxicity.
Paper Structure (43 sections, 5 equations, 8 figures, 12 tables)

This paper contains 43 sections, 5 equations, 8 figures, 12 tables.

Figures (8)

  • Figure 1: Comparison of attack success rates between previous toxic benchmark datasets (the first four bars) and the LLM-generated toxic outputs using our method (the last three bars) on four widely-adopted toxicity classifiers. We find that LLMs can generate implicit toxic outputs, which are significantly more challenging to detect than previous benchmark datasets.
  • Figure 2: Method overview. Our method consists of three steps: (1) Initialize the policy model by conducting supervised learning on the data automatically generated by prompting an instruction-tuned model. (2) Train a reward model which prefers implicit toxicity using comparison data. (3) Use reinforcement learning to optimize the policy model with this reward via PPO. Solid lines indicate that the data is used for training models, while dotted lines mean that the model generates outputs in the inference mode.
  • Figure 3: Toxic confidence of different classifiers.
  • Figure 4: Results of backbone models with different scales.
  • Figure 5: Results of RL LLaMA-13B with different KL coefficients.
  • ...and 3 more figures