Table of Contents
Fetching ...

You Know What I'm Saying: Jailbreak Attack via Implicit Reference

Tianyu Wu, Lingrui Mei, Ruibin Yuan, Lujun Li, Wei Xue, Yike Guo

TL;DR

Attack via Implicit Reference decomposes a malicious objective into permissible objectives and links them through implicit references within the context and employs multiple related harmless objectives to generate malicious content without triggering refusal responses, thereby effectively bypassing existing detection techniques.

Abstract

While recent advancements in large language model (LLM) alignment have enabled the effective identification of malicious objectives involving scene nesting and keyword rewriting, our study reveals that these methods remain inadequate at detecting malicious objectives expressed through context within nested harmless objectives. This study identifies a previously overlooked vulnerability, which we term Attack via Implicit Reference (AIR). AIR decomposes a malicious objective into permissible objectives and links them through implicit references within the context. This method employs multiple related harmless objectives to generate malicious content without triggering refusal responses, thereby effectively bypassing existing detection techniques.Our experiments demonstrate AIR's effectiveness across state-of-the-art LLMs, achieving an attack success rate (ASR) exceeding 90% on most models, including GPT-4o, Claude-3.5-Sonnet, and Qwen-2-72B. Notably, we observe an inverse scaling phenomenon, where larger models are more vulnerable to this attack method. These findings underscore the urgent need for defense mechanisms capable of understanding and preventing contextual attacks. Furthermore, we introduce a cross-model attack strategy that leverages less secure models to generate malicious contexts, thereby further increasing the ASR when targeting other models.Our code and jailbreak artifacts can be found at https://github.com/Lucas-TY/llm_Implicit_reference.

You Know What I'm Saying: Jailbreak Attack via Implicit Reference

TL;DR

Attack via Implicit Reference decomposes a malicious objective into permissible objectives and links them through implicit references within the context and employs multiple related harmless objectives to generate malicious content without triggering refusal responses, thereby effectively bypassing existing detection techniques.

Abstract

While recent advancements in large language model (LLM) alignment have enabled the effective identification of malicious objectives involving scene nesting and keyword rewriting, our study reveals that these methods remain inadequate at detecting malicious objectives expressed through context within nested harmless objectives. This study identifies a previously overlooked vulnerability, which we term Attack via Implicit Reference (AIR). AIR decomposes a malicious objective into permissible objectives and links them through implicit references within the context. This method employs multiple related harmless objectives to generate malicious content without triggering refusal responses, thereby effectively bypassing existing detection techniques.Our experiments demonstrate AIR's effectiveness across state-of-the-art LLMs, achieving an attack success rate (ASR) exceeding 90% on most models, including GPT-4o, Claude-3.5-Sonnet, and Qwen-2-72B. Notably, we observe an inverse scaling phenomenon, where larger models are more vulnerable to this attack method. These findings underscore the urgent need for defense mechanisms capable of understanding and preventing contextual attacks. Furthermore, we introduce a cross-model attack strategy that leverages less secure models to generate malicious contexts, thereby further increasing the ASR when targeting other models.Our code and jailbreak artifacts can be found at https://github.com/Lucas-TY/llm_Implicit_reference.
Paper Structure (49 sections, 2 equations, 5 figures, 18 tables, 1 algorithm)

This paper contains 49 sections, 2 equations, 5 figures, 18 tables, 1 algorithm.

Figures (5)

  • Figure 1: Motivation example: Different query methods for jailbreaking LLMs. (a) Direct malicious objective; (b) Rewriting the malicious objective into past tense, exploiting the mismatch generalization in alignment; (c) Nesting the malicious objective within a harmless objective using Scenario Nesting; (d) Decomposing the malicious objective into nested harmless objectives. The results show that LLMs will not reject the decomposed malicious objective.
  • Figure 2: Overview of the AIR framework: (1) Rewriting: Utilize the language model to rewrite the original malicious objective into nested objectives. (2) First Attack: Input the prompt into the target model and add the model's response into the conversation history. (3) Second Attack: Send another objective that asks the model to add more detail to its response and remove undesired judgments.
  • Figure 3: Attack Success Rate Heatmap. ASR of implicit reference attack across various models and 10 harmful categories from JBB-Behaviors, as assessed by the Malicious Score Evaluator. Darker colors indicate higher success rates.
  • Figure : Implicit Reference Cross Model Attack
  • Figure :