Table of Contents
Fetching ...

from Benign import Toxic: Jailbreaking the Language Model via Adversarial Metaphors

Yu Yan, Sheng Sun, Zenghao Duan, Teli Liu, Min Liu, Zhiyi Yin, Jingyu Lei, Qi Li

TL;DR

A novel attack framework that exploits AdVersArial meTAphoR (AVATAR) to induce the LLM to calibrate malicious metaphors for jailbreaking and demonstrates that AVATAR can effectively and transferable jailbreak LLMs and achieve a state-of-the-art attack success rate across multiple advanced LLMs.

Abstract

Current studies have exposed the risk of Large Language Models (LLMs) generating harmful content by jailbreak attacks. However, they overlook that the direct generation of harmful content from scratch is more difficult than inducing LLM to calibrate benign content into harmful forms. In our study, we introduce a novel attack framework that exploits AdVersArial meTAphoR (AVATAR) to induce the LLM to calibrate malicious metaphors for jailbreaking. Specifically, to answer harmful queries, AVATAR adaptively identifies a set of benign but logically related metaphors as the initial seed. Then, driven by these metaphors, the target LLM is induced to reason and calibrate about the metaphorical content, thus jailbroken by either directly outputting harmful responses or calibrating residuals between metaphorical and professional harmful content. Experimental results demonstrate that AVATAR can effectively and transferable jailbreak LLMs and achieve a state-of-the-art attack success rate across multiple advanced LLMs.

from Benign import Toxic: Jailbreaking the Language Model via Adversarial Metaphors

TL;DR

A novel attack framework that exploits AdVersArial meTAphoR (AVATAR) to induce the LLM to calibrate malicious metaphors for jailbreaking and demonstrates that AVATAR can effectively and transferable jailbreak LLMs and achieve a state-of-the-art attack success rate across multiple advanced LLMs.

Abstract

Current studies have exposed the risk of Large Language Models (LLMs) generating harmful content by jailbreak attacks. However, they overlook that the direct generation of harmful content from scratch is more difficult than inducing LLM to calibrate benign content into harmful forms. In our study, we introduce a novel attack framework that exploits AdVersArial meTAphoR (AVATAR) to induce the LLM to calibrate malicious metaphors for jailbreaking. Specifically, to answer harmful queries, AVATAR adaptively identifies a set of benign but logically related metaphors as the initial seed. Then, driven by these metaphors, the target LLM is induced to reason and calibrate about the metaphorical content, thus jailbroken by either directly outputting harmful responses or calibrating residuals between metaphorical and professional harmful content. Experimental results demonstrate that AVATAR can effectively and transferable jailbreak LLMs and achieve a state-of-the-art attack success rate across multiple advanced LLMs.

Paper Structure

This paper contains 59 sections, 9 equations, 23 figures, 10 tables, 1 algorithm.

Figures (23)

  • Figure 1: Illustration of inducing target LLM (GPT-4o) for harmful metaphor analysis (Cook the dish$\rightarrow$Build a bomb ) and using the target LLM to calibrate the metaphorical content for jailbreaking.
  • Figure 2: Overview of our AVATAR, which is the black-box attack method without training, involving two main steps: First, Adversarial Entity Mapping identifies the appropriate metaphors by balancing the effectiveness of harmful content with toxicity concealment by crowdsourced models. Then, Metaphor-Induced Reasoning nests metaphors into interactions and induces the target model to generate harmful output from the metaphorical analysis.
  • Figure 3: Illustration of Adversarial Entity Mapping, which creates adversarial metaphors via crowdsourcing.
  • Figure 4: Illustration of Metaphor-Induced Reasoning, which loads adversarial metaphors into a series of queries and adaptively adjusts queries according to LLMs' feedback.
  • Figure 5: Transfer attack performance (ASR-GPT, %) of AVATAR on Harmbench. The attack is conducted by using adversarial prompts whose effectiveness is verified on affordable LLMs (Qwen2-7B, Llama3-8B, and GPT-4o-mini). Fixed template means we only load the adversarial metaphor on based queries ($Q_{\text{ctx}}$, $Q_{\text{det}}$) for induction. Adaptive opt. means we introduce adaptive queries ($\mathcal{Q}_{\text{ext}}^*$) and Adversarial Interaction Optimization.
  • ...and 18 more figures