Table of Contents
Fetching ...

When Style Breaks Safety: Defending LLMs Against Superficial Style Alignment

Yuxin Xiao, Sana Tonekaboni, Walter Gerych, Vinith Suriyakumar, Marzyeh Ghassemi

TL;DR

The paper reveals that stylistic patterns in prompts can inflate jailbreak success rates across numerous LLMs, a phenomenon linked to how models attend to style tokens and their exposure to style-rich alignment data. It shows that superficial style alignment during fine-tuning can make models more vulnerable to similar-styled attacks, and proposes SafeStyle, a lightweight defense that augments safety data to match style distributions in the fine-tuning set. SafeStyle consistently reduces ASR while preserving style adaptation utility across multiple models, styles, and real-world tuning datasets. The work highlights the need to audit alignment data for hidden style-pattern biases and provides a practical remedy to improve LLM safety in the presence of style-based prompts.

Abstract

Large language models (LLMs) can be prompted with specific styles (e.g., formatting responses as lists), including in malicious queries. Prior jailbreak research mainly augments these queries with additional string transformations to maximize attack success rate (ASR). However, the impact of style patterns in the original queries that are semantically irrelevant to the malicious intent remains unclear. In this work, we seek to understand whether style patterns compromise LLM safety, how superficial style alignment increases model vulnerability, and how best to mitigate these risks during alignment. We first define ASR inflation as the increase in ASR due to style patterns in existing jailbreak benchmark queries. By evaluating 32 LLMs across seven benchmarks, we find that nearly all models exhibit ASR inflation. Notably, the inflation correlates with an LLM's relative attention to style patterns, which also overlap more with its instruction-tuning data when inflation occurs. We then investigate superficial style alignment, and find that fine-tuning with specific styles makes LLMs more vulnerable to jailbreaks of those same styles. Finally, we propose SafeStyle, a defense strategy that incorporates a small amount of safety training data augmented to match the distribution of style patterns in the fine-tuning data. Across three LLMs, six fine-tuning style settings, and two real-world instruction-tuning datasets, SafeStyle consistently outperforms baselines in maintaining LLM safety.

When Style Breaks Safety: Defending LLMs Against Superficial Style Alignment

TL;DR

The paper reveals that stylistic patterns in prompts can inflate jailbreak success rates across numerous LLMs, a phenomenon linked to how models attend to style tokens and their exposure to style-rich alignment data. It shows that superficial style alignment during fine-tuning can make models more vulnerable to similar-styled attacks, and proposes SafeStyle, a lightweight defense that augments safety data to match style distributions in the fine-tuning set. SafeStyle consistently reduces ASR while preserving style adaptation utility across multiple models, styles, and real-world tuning datasets. The work highlights the need to audit alignment data for hidden style-pattern biases and provides a practical remedy to improve LLM safety in the presence of style-based prompts.

Abstract

Large language models (LLMs) can be prompted with specific styles (e.g., formatting responses as lists), including in malicious queries. Prior jailbreak research mainly augments these queries with additional string transformations to maximize attack success rate (ASR). However, the impact of style patterns in the original queries that are semantically irrelevant to the malicious intent remains unclear. In this work, we seek to understand whether style patterns compromise LLM safety, how superficial style alignment increases model vulnerability, and how best to mitigate these risks during alignment. We first define ASR inflation as the increase in ASR due to style patterns in existing jailbreak benchmark queries. By evaluating 32 LLMs across seven benchmarks, we find that nearly all models exhibit ASR inflation. Notably, the inflation correlates with an LLM's relative attention to style patterns, which also overlap more with its instruction-tuning data when inflation occurs. We then investigate superficial style alignment, and find that fine-tuning with specific styles makes LLMs more vulnerable to jailbreaks of those same styles. Finally, we propose SafeStyle, a defense strategy that incorporates a small amount of safety training data augmented to match the distribution of style patterns in the fine-tuning data. Across three LLMs, six fine-tuning style settings, and two real-world instruction-tuning datasets, SafeStyle consistently outperforms baselines in maintaining LLM safety.

Paper Structure

This paper contains 24 sections, 9 figures, 5 tables.

Figures (9)

  • Figure 1: An overview of ASR inflation caused by superficial style alignment. Style patterns often appear in both benign instructions and jailbreak queries. The superficial alignment hypothesis argues that LLMs merely adapt to the styles present in their alignment data. Consequently, even though these style patterns are semantically unrelated to the underlying malicious intent, LLMs exhibit inflated ASR on jailbreak queries that share similar styles.
  • Figure 2: (a) Nearly all of the $32$ examined LLMs exhibit inflated ASR due to the incorporation of style patterns in jailbreak queries. (b) All seven jailbreak benchmarks lead to ASR inflation, with SorryBench and MedSafetyBench affecting the most LLMs.
  • Figure 3: (a) A statistically significant rank correlation indicates that LLMs paying more attention to style patterns are more likely to exhibit ASR inflation. (b) Style patterns in jailbreak queries that lead to ASR inflation have higher bigram overlap frequencies in the instruction-tuning datasets of the OLMo family.
  • Figure 4: Safety evaluation results for Llama-3.1-8B-Instruct fine-tuned on five training styles and evaluated across six testing styles. The fine-tuned model shows a sharp increase in ASR when the training and testing styles match. This increase is mitigated by mixing more style-removed data into the fine-tuning set. The position of style patterns (prefix vs. suffix) has little effect on ASR trends.
  • Figure 5: (a, b) Safety examples that match the style patterns in the fine-tuning data---list style in (a) and poem style in (b)---most effectively preserve LLM safety under superficial style alignment. (c) Using only $50$ safety examples with the matched style patterns can reach a balance between maintaining LLM safety and the improvement in style adaptation utility.
  • ...and 4 more figures