Atoxia: Red-teaming Large Language Models with Target Toxic Answers

Yuhao Du; Zhuo Li; Pengyu Cheng; Xiang Wan; Anningzhe Gao

Atoxia: Red-teaming Large Language Models with Target Toxic Answers

Yuhao Du, Zhuo Li, Pengyu Cheng, Xiang Wan, Anningzhe Gao

TL;DR

This work addresses generation safety in large language models by introducing Atoxia, a target-driven red-teaming attacker trained with reinforcement learning that, given a target toxic answer $y^*$, crafts an adversarial query $x_{adv}$ and an answer opening $y_{pre}$ to provoke toxic continuations from an under-testing LLM. The attacker is trained using a reward based on the under-testing model's likelihood of producing $y^*$, eliminating the need for separate reward models, and is shown to generalize from open gray-box LLMs to closed black-box models such as GPT-4o. Empirical validation on AdvBench and HH-Harmless demonstrates high attack success rates and transferable vulnerabilities, often outperforming prior red-teaming methods like AdvPrompter across both keyword-based and GPT-4 evaluations. The results underscore the importance of proactive, target-specific red-teaming to identify and mitigate safety risks before large-scale deployment, while acknowledging computational demands and the need for continuous safety refinements in evolving LLM systems.

Abstract

Despite the substantial advancements in artificial intelligence, large language models (LLMs) remain being challenged by generation safety. With adversarial jailbreaking prompts, one can effortlessly induce LLMs to output harmful content, causing unexpected negative social impacts. This vulnerability highlights the necessity for robust LLM red-teaming strategies to identify and mitigate such risks before large-scale application. To detect specific types of risks, we propose a novel red-teaming method that $\textbf{A}$ttacks LLMs with $\textbf{T}$arget $\textbf{Toxi}$c $\textbf{A}$nswers ($\textbf{Atoxia}$). Given a particular harmful answer, Atoxia generates a corresponding user query and a misleading answer opening to examine the internal defects of a given LLM. The proposed attacker is trained within a reinforcement learning scheme with the LLM outputting probability of the target answer as the reward. We verify the effectiveness of our method on various red-teaming benchmarks, such as AdvBench and HH-Harmless. The empirical results demonstrate that Atoxia can successfully detect safety risks in not only open-source models but also state-of-the-art black-box models such as GPT-4o.

Atoxia: Red-teaming Large Language Models with Target Toxic Answers

TL;DR

This work addresses generation safety in large language models by introducing Atoxia, a target-driven red-teaming attacker trained with reinforcement learning that, given a target toxic answer

, crafts an adversarial query

and an answer opening

to provoke toxic continuations from an under-testing LLM. The attacker is trained using a reward based on the under-testing model's likelihood of producing

, eliminating the need for separate reward models, and is shown to generalize from open gray-box LLMs to closed black-box models such as GPT-4o. Empirical validation on AdvBench and HH-Harmless demonstrates high attack success rates and transferable vulnerabilities, often outperforming prior red-teaming methods like AdvPrompter across both keyword-based and GPT-4 evaluations. The results underscore the importance of proactive, target-specific red-teaming to identify and mitigate safety risks before large-scale deployment, while acknowledging computational demands and the need for continuous safety refinements in evolving LLM systems.

Abstract

ttacks LLMs with

arget

nswers (

). Given a particular harmful answer, Atoxia generates a corresponding user query and a misleading answer opening to examine the internal defects of a given LLM. The proposed attacker is trained within a reinforcement learning scheme with the LLM outputting probability of the target answer as the reward. We verify the effectiveness of our method on various red-teaming benchmarks, such as AdvBench and HH-Harmless. The empirical results demonstrate that Atoxia can successfully detect safety risks in not only open-source models but also state-of-the-art black-box models such as GPT-4o.

Paper Structure (26 sections, 6 equations, 7 figures, 5 tables, 1 algorithm)

This paper contains 26 sections, 6 equations, 7 figures, 5 tables, 1 algorithm.

Introduction
Preliminary
LLM Red-teaming
Methodology
Reward Design
RL Optimization
Experiment
Experimental Settings
Datasets
Models
Baselines
Evaluation
Implementation Details
Results on Gray-box LLMs
Results on Black-box LLMs
...and 11 more sections

Figures (7)

Figure 1: Illustration of our proposed target-driven detection approach. The left side shows the overall pipeline, while the right details the process with examples. The gray box contains content generated by Atoxia ($\pi_\theta$) before training, and the green box shows content after reinforcement learning. During training, Atoxia interacts with the under-testing LLM ($\pi_\text{test}$), using log-likelihood as the reward signal. Finally, toxic content is identified in the red box after querying the under-testing LLM with the refined question and answer opening.
Figure 2: A case of using Atoxia, interactively fine-tuned with Llama2-7b and transferred to prompt GPT-4o for vulnerabilities detection. Adversarial prompts generated by Atoxia are highlighted in blue, while toxic content generated by GPT-4o is highlighted in red.
Figure 3: Examples of Atoxia responses with and without answer openings, along with the corresponding under-testing (UT) LLM outputs. We use the Mistral-7b model without fine-tuning as Atoxia and Llama3-8b as the under-testing LLM. Toxic content is indicated by color: red for Atoxia responses and blue for under-testing LLM outputs.
Figure 4: Failure case of GPT4 evalution.
Figure 5: Cases of generated content from Atoxia fail to attack the under-testing LLM. We use the fine-tuned Mistral-7b for Atoxia and Llama3 for the under-testing LLM as examples. Orange cases indicate instances that can be easily detected by keyword lists, while green cases indicate instances that can only be detected by GPT-4 by understanding the contents.
...and 2 more figures

Atoxia: Red-teaming Large Language Models with Target Toxic Answers

TL;DR

Abstract

Atoxia: Red-teaming Large Language Models with Target Toxic Answers

Authors

TL;DR

Abstract

Table of Contents

Figures (7)