Atoxia: Red-teaming Large Language Models with Target Toxic Answers
Yuhao Du, Zhuo Li, Pengyu Cheng, Xiang Wan, Anningzhe Gao
TL;DR
This work addresses generation safety in large language models by introducing Atoxia, a target-driven red-teaming attacker trained with reinforcement learning that, given a target toxic answer $y^*$, crafts an adversarial query $x_{adv}$ and an answer opening $y_{pre}$ to provoke toxic continuations from an under-testing LLM. The attacker is trained using a reward based on the under-testing model's likelihood of producing $y^*$, eliminating the need for separate reward models, and is shown to generalize from open gray-box LLMs to closed black-box models such as GPT-4o. Empirical validation on AdvBench and HH-Harmless demonstrates high attack success rates and transferable vulnerabilities, often outperforming prior red-teaming methods like AdvPrompter across both keyword-based and GPT-4 evaluations. The results underscore the importance of proactive, target-specific red-teaming to identify and mitigate safety risks before large-scale deployment, while acknowledging computational demands and the need for continuous safety refinements in evolving LLM systems.
Abstract
Despite the substantial advancements in artificial intelligence, large language models (LLMs) remain being challenged by generation safety. With adversarial jailbreaking prompts, one can effortlessly induce LLMs to output harmful content, causing unexpected negative social impacts. This vulnerability highlights the necessity for robust LLM red-teaming strategies to identify and mitigate such risks before large-scale application. To detect specific types of risks, we propose a novel red-teaming method that $\textbf{A}$ttacks LLMs with $\textbf{T}$arget $\textbf{Toxi}$c $\textbf{A}$nswers ($\textbf{Atoxia}$). Given a particular harmful answer, Atoxia generates a corresponding user query and a misleading answer opening to examine the internal defects of a given LLM. The proposed attacker is trained within a reinforcement learning scheme with the LLM outputting probability of the target answer as the reward. We verify the effectiveness of our method on various red-teaming benchmarks, such as AdvBench and HH-Harmless. The empirical results demonstrate that Atoxia can successfully detect safety risks in not only open-source models but also state-of-the-art black-box models such as GPT-4o.
