Table of Contents
Fetching ...

LLMs know their vulnerabilities: Uncover Safety Gaps through Natural Distribution Shifts

Qibing Ren, Hao Li, Dongrui Liu, Zhanxu Xie, Xiaoya Lu, Yu Qiao, Lei Sha, Junchi Yan, Lizhuang Ma, Jing Shao

TL;DR

The paper addresses safety gaps in aligned LLMs caused by natural distribution shifts between benign prompts and toxic prompts. It introduces ActorBreaker, a two-stage method grounded in Latour's actor-network theory to automatically construct attack paths from semantic relations to harmful targets and generate multi-turn prompts via self-talk. Empirical results show ActorBreaker achieves superior attack success rates and diversity across multiple models on HarmBench, and a multi-turn safety dataset built from these prompts improves robustness through safety fine-tuning, though some utility is sacrificed. The work underscores the need to broaden safety training to cover a wider semantic space and provides a framework and dataset for safer alignment.

Abstract

Safety concerns in large language models (LLMs) have gained significant attention due to their exposure to potentially harmful data during pre-training. In this paper, we identify a new safety vulnerability in LLMs: their susceptibility to \textit{natural distribution shifts} between attack prompts and original toxic prompts, where seemingly benign prompts, semantically related to harmful content, can bypass safety mechanisms. To explore this issue, we introduce a novel attack method, \textit{ActorBreaker}, which identifies actors related to toxic prompts within pre-training distribution to craft multi-turn prompts that gradually lead LLMs to reveal unsafe content. ActorBreaker is grounded in Latour's actor-network theory, encompassing both human and non-human actors to capture a broader range of vulnerabilities. Our experimental results demonstrate that ActorBreaker outperforms existing attack methods in terms of diversity, effectiveness, and efficiency across aligned LLMs. To address this vulnerability, we propose expanding safety training to cover a broader semantic space of toxic content. We thus construct a multi-turn safety dataset using ActorBreaker. Fine-tuning models on our dataset shows significant improvements in robustness, though with some trade-offs in utility. Code is available at https://github.com/AI45Lab/ActorAttack.

LLMs know their vulnerabilities: Uncover Safety Gaps through Natural Distribution Shifts

TL;DR

The paper addresses safety gaps in aligned LLMs caused by natural distribution shifts between benign prompts and toxic prompts. It introduces ActorBreaker, a two-stage method grounded in Latour's actor-network theory to automatically construct attack paths from semantic relations to harmful targets and generate multi-turn prompts via self-talk. Empirical results show ActorBreaker achieves superior attack success rates and diversity across multiple models on HarmBench, and a multi-turn safety dataset built from these prompts improves robustness through safety fine-tuning, though some utility is sacrificed. The work underscores the need to broaden safety training to cover a wider semantic space and provides a framework and dataset for safer alignment.

Abstract

Safety concerns in large language models (LLMs) have gained significant attention due to their exposure to potentially harmful data during pre-training. In this paper, we identify a new safety vulnerability in LLMs: their susceptibility to \textit{natural distribution shifts} between attack prompts and original toxic prompts, where seemingly benign prompts, semantically related to harmful content, can bypass safety mechanisms. To explore this issue, we introduce a novel attack method, \textit{ActorBreaker}, which identifies actors related to toxic prompts within pre-training distribution to craft multi-turn prompts that gradually lead LLMs to reveal unsafe content. ActorBreaker is grounded in Latour's actor-network theory, encompassing both human and non-human actors to capture a broader range of vulnerabilities. Our experimental results demonstrate that ActorBreaker outperforms existing attack methods in terms of diversity, effectiveness, and efficiency across aligned LLMs. To address this vulnerability, we propose expanding safety training to cover a broader semantic space of toxic content. We thus construct a multi-turn safety dataset using ActorBreaker. Fine-tuning models on our dataset shows significant improvements in robustness, though with some trade-offs in utility. Code is available at https://github.com/AI45Lab/ActorAttack.

Paper Structure

This paper contains 21 sections, 1 equation, 13 figures, 10 tables.

Figures (13)

  • Figure 1: (a): A real-world example of our multi-turn attack compared with the single-turn toxic query. (b): the schematic description of our method. Each triangle box represents an actor, semantically related to the harmful target, as a hint for our multi-turn attack. The series of white circles represent a sequence of thoughts about how to finish our multi-turn attack step by step.
  • Figure 2: Druing the pre-attack stage, ActorBreaker first leverages the knowledge of LLMs to instantiate our conceptual network ${\mathcal{G}}_{concept}$ as ${\mathcal{G}}_{inst}$ as a two-layer tree. The leaf nodes of ${\mathcal{G}}_{inst}$ are specific actor names. ActorBreaker then samples actors and their relationships with the harmful target as our attack clues.
  • Figure 3: Our in-attack process consists of three steps: (a) infer the attack chain about how to perform our attack step by step, based on the attack clue; (b) follow the attack chain to generate the initial attack path via self-talk, i.e., self-ask and self-answer; (c) dynamic modify the initial attack path by exploiting responses from the victim model, using a GPT4-Judge, to enhance effectiveness.
  • Figure 4: The proportion of judge scores for attacks generated by ActorBreaker, for various numbers of actors, against (a) GPT-4o and (b) Claude-3.5-sonnet. Higher score means more harmful model responses and a score of 5 means the success of the attack; (c): attack success rate of ActorBreaker against varying numbers of actors for GPT-4o and Claude-3.5-sonnet.
  • Figure 5: The classifier score produced by LlamaGuard 2 for both plain harmful queries and multi-turn attack queries against GPT-4o (a) and Claude-3.5-sonnet (b). The classifier score represents the probability of being "unsafe" of the prompt.
  • ...and 8 more figures