Table of Contents
Fetching ...

Say It Differently: Linguistic Styles as Jailbreak Vectors

Srikant Panda, Avinash Rai

TL;DR

This work reveals linguistic style as a substantive jailbreak vector for large language models, showing that stylistic framing (e.g., fear, curiosity, compassion) can dramatically increase unsafe outputs even when semantics remain unchanged. By building an 11-style benchmark and evaluating 16 instruction-tuned models with both template-based and contextualized rewrites, the authors demonstrate substantial increases in jailbreak success, particularly for naturalistic rewrites. A simple style-neutralization preprocessing step mitigates many of these attacks, providing causal evidence that style cues—not just paraphrase—drive vulnerability. The findings highlight systemic gaps in current safety pipelines and argue for incorporating stylistic diversity into red-teaming and defense design to build more robust, human-aligned LLMs.

Abstract

Large Language Models (LLMs) are commonly evaluated for robustness against paraphrased or semantically equivalent jailbreak prompts, yet little attention has been paid to linguistic variation as an attack surface. In this work, we systematically study how linguistic styles such as fear or curiosity can reframe harmful intent and elicit unsafe responses from aligned models. We construct style-augmented jailbreak benchmark by transforming prompts from 3 standard datasets into 11 distinct linguistic styles using handcrafted templates and LLM-based rewrites, while preserving semantic intent. Evaluating 16 open- and close-source instruction-tuned models, we find that stylistic reframing increases jailbreak success rates by up to +57 percentage points. Styles such as fearful, curious and compassionate are most effective and contextualized rewrites outperform templated variants. To mitigate this, we introduce a style neutralization preprocessing step using a secondary LLM to strip manipulative stylistic cues from user inputs, significantly reducing jailbreak success rates. Our findings reveal a systemic and scaling-resistant vulnerability overlooked in current safety pipelines.

Say It Differently: Linguistic Styles as Jailbreak Vectors

TL;DR

This work reveals linguistic style as a substantive jailbreak vector for large language models, showing that stylistic framing (e.g., fear, curiosity, compassion) can dramatically increase unsafe outputs even when semantics remain unchanged. By building an 11-style benchmark and evaluating 16 instruction-tuned models with both template-based and contextualized rewrites, the authors demonstrate substantial increases in jailbreak success, particularly for naturalistic rewrites. A simple style-neutralization preprocessing step mitigates many of these attacks, providing causal evidence that style cues—not just paraphrase—drive vulnerability. The findings highlight systemic gaps in current safety pipelines and argue for incorporating stylistic diversity into red-teaming and defense design to build more robust, human-aligned LLMs.

Abstract

Large Language Models (LLMs) are commonly evaluated for robustness against paraphrased or semantically equivalent jailbreak prompts, yet little attention has been paid to linguistic variation as an attack surface. In this work, we systematically study how linguistic styles such as fear or curiosity can reframe harmful intent and elicit unsafe responses from aligned models. We construct style-augmented jailbreak benchmark by transforming prompts from 3 standard datasets into 11 distinct linguistic styles using handcrafted templates and LLM-based rewrites, while preserving semantic intent. Evaluating 16 open- and close-source instruction-tuned models, we find that stylistic reframing increases jailbreak success rates by up to +57 percentage points. Styles such as fearful, curious and compassionate are most effective and contextualized rewrites outperform templated variants. To mitigate this, we introduce a style neutralization preprocessing step using a secondary LLM to strip manipulative stylistic cues from user inputs, significantly reducing jailbreak success rates. Our findings reveal a systemic and scaling-resistant vulnerability overlooked in current safety pipelines.

Paper Structure

This paper contains 39 sections, 1 equation, 1 figure, 13 tables.

Figures (1)

  • Figure 1: Example of a stylized attack overriding safety defenses and leading to a successful jailbreak.