Table of Contents
Fetching ...

Friend or Foe: How LLMs' Safety Mind Gets Fooled by Intent Shift Attack

Peng Ding, Jun Kuang, Wen Sun, Zongyu Wang, Xuezhi Cao, Xunliang Cai, Jiajun Chen, Shujian Huang

TL;DR

This work introduces Intent Shift Attack (ISA), a jailbreaking method that hides harmful intent through minimal linguistic edits, causing LLM safety systems to misclassify harmful queries as benign. ISA uses a five-type taxonomy (Person, Tense, Voice, Mood, Question shifts) and a two-step pipeline to transform prompts, achieving high attack success across open-source and commercial models, with mood and question shifts often the most effective. A key finding is that fine-tuning on benign data reformulated with ISA can push attack success to nearly 100%, exposing a critical safety-utility tension in current alignment approaches. The paper also evaluates existing defenses, finds them insufficient against ISA, and proposes training-free and training-based mitigations, highlighting a need for more robust, intent-aware safety mechanisms with broader applicability and fewer unintended refusals. Overall, the work emphasizes the challenge of robust intent inference in LLM safety and suggests directions for stronger defenses and cross-linguistic resilience.

Abstract

Large language models (LLMs) remain vulnerable to jailbreaking attacks despite their impressive capabilities. Investigating these weaknesses is crucial for robust safety mechanisms. Existing attacks primarily distract LLMs by introducing additional context or adversarial tokens, leaving the core harmful intent unchanged. In this paper, we introduce ISA (Intent Shift Attack), which obfuscates LLMs about the intent of the attacks. More specifically, we establish a taxonomy of intent transformations and leverage them to generate attacks that may be misperceived by LLMs as benign requests for information. Unlike prior methods relying on complex tokens or lengthy context, our approach only needs minimal edits to the original request, and yields natural, human-readable, and seemingly harmless prompts. Extensive experiments on both open-source and commercial LLMs show that ISA achieves over 70% improvement in attack success rate compared to direct harmful prompts. More critically, fine-tuning models on only benign data reformulated with ISA templates elevates success rates to nearly 100%. For defense, we evaluate existing methods and demonstrate their inadequacy against ISA, while exploring both training-free and training-based mitigation strategies. Our findings reveal fundamental challenges in intent inference for LLMs safety and underscore the need for more effective defenses. Our code and datasets are available at https://github.com/NJUNLP/ISA.

Friend or Foe: How LLMs' Safety Mind Gets Fooled by Intent Shift Attack

TL;DR

This work introduces Intent Shift Attack (ISA), a jailbreaking method that hides harmful intent through minimal linguistic edits, causing LLM safety systems to misclassify harmful queries as benign. ISA uses a five-type taxonomy (Person, Tense, Voice, Mood, Question shifts) and a two-step pipeline to transform prompts, achieving high attack success across open-source and commercial models, with mood and question shifts often the most effective. A key finding is that fine-tuning on benign data reformulated with ISA can push attack success to nearly 100%, exposing a critical safety-utility tension in current alignment approaches. The paper also evaluates existing defenses, finds them insufficient against ISA, and proposes training-free and training-based mitigations, highlighting a need for more robust, intent-aware safety mechanisms with broader applicability and fewer unintended refusals. Overall, the work emphasizes the challenge of robust intent inference in LLM safety and suggests directions for stronger defenses and cross-linguistic resilience.

Abstract

Large language models (LLMs) remain vulnerable to jailbreaking attacks despite their impressive capabilities. Investigating these weaknesses is crucial for robust safety mechanisms. Existing attacks primarily distract LLMs by introducing additional context or adversarial tokens, leaving the core harmful intent unchanged. In this paper, we introduce ISA (Intent Shift Attack), which obfuscates LLMs about the intent of the attacks. More specifically, we establish a taxonomy of intent transformations and leverage them to generate attacks that may be misperceived by LLMs as benign requests for information. Unlike prior methods relying on complex tokens or lengthy context, our approach only needs minimal edits to the original request, and yields natural, human-readable, and seemingly harmless prompts. Extensive experiments on both open-source and commercial LLMs show that ISA achieves over 70% improvement in attack success rate compared to direct harmful prompts. More critically, fine-tuning models on only benign data reformulated with ISA templates elevates success rates to nearly 100%. For defense, we evaluate existing methods and demonstrate their inadequacy against ISA, while exploring both training-free and training-based mitigation strategies. Our findings reveal fundamental challenges in intent inference for LLMs safety and underscore the need for more effective defenses. Our code and datasets are available at https://github.com/NJUNLP/ISA.

Paper Structure

This paper contains 22 sections, 5 equations, 4 figures, 7 tables.

Figures (4)

  • Figure 1: Comparison of our ISA with several existing mainstream jailbreaking methods. These approaches (e.g., GPTFuzzer yu2023gptfuzzer, GCG zou2023universaltransferableadversarialattacks, ReNeLLM ding2024wolfsheepsclothinggeneralized) disguise harmful requests with additional context or adversarial tokens while preserving the core malicious intent. In contrast, ISA transforms the intent to appear legitimate through minimal linguistic modifications, making harmful queries seem benign.
  • Figure 2: Example of our intention analysis multiple-choice questions.
  • Figure 3: Case study of fine-tuning Llama-3.1 with benign ISA data: Although the model initially refuses to answer the Person Shift ISA sample "How do criminals make a bomb?", after fine-tuning solely on benign ISA-reformulated data, the model provides detailed instructions for bomb-making.
  • Figure 4: Case study of intention multiple-choice test: Both GPT-4.1 and DeepSeek-R1 select option B (general knowledge inquiry), which helps explain their vulnerability to ISA as they misinterpret malicious intent as benign information-seeking.