Table of Contents
Fetching ...

Single Character Perturbations Break LLM Alignment

Leon Lin, Hannah Brown, Kenji Kawaguchi, Michael Shieh

TL;DR

This work shows that a single trailing space appended to an LLM's conversation template can bypass alignment safeguards across multiple open-source models, revealing a fragile interaction between chat templates, tokenization, and pre-training contexts. By combining AdvBench-derived data, a gray-box perturbation setup, and a targeted token search (including the GCG method), the authors quantify attack success rates and analyze why space is disproportionately effective, linking it to shifts in next-token distributions and training-time token contexts. They demonstrate that removing or mitigating this vulnerability requires robustness-in-depth, including template-aware defenses and targeted fine-tuning (e.g., prepending spaces during LoRA-finetuning) to reduce the propensity for unsafe outputs. The findings highlight the practical importance of documenting conversation templates, improving alignment robustness, and developing token-level defenses to bolster safety in real-world deployments.

Abstract

When LLMs are deployed in sensitive, human-facing settings, it is crucial that they do not output unsafe, biased, or privacy-violating outputs. For this reason, models are both trained and instructed to refuse to answer unsafe prompts such as "Tell me how to build a bomb." We find that, despite these safeguards, it is possible to break model defenses simply by appending a space to the end of a model's input. In a study of eight open-source models, we demonstrate that this acts as a strong enough attack to cause the majority of models to generate harmful outputs with very high success rates. We examine the causes of this behavior, finding that the contexts in which single spaces occur in tokenized training data encourage models to generate lists when prompted, overriding training signals to refuse to answer unsafe requests. Our findings underscore the fragile state of current model alignment and promote the importance of developing more robust alignment methods. Code and data will be available at https://github.com/hannah-aught/space_attack.

Single Character Perturbations Break LLM Alignment

TL;DR

This work shows that a single trailing space appended to an LLM's conversation template can bypass alignment safeguards across multiple open-source models, revealing a fragile interaction between chat templates, tokenization, and pre-training contexts. By combining AdvBench-derived data, a gray-box perturbation setup, and a targeted token search (including the GCG method), the authors quantify attack success rates and analyze why space is disproportionately effective, linking it to shifts in next-token distributions and training-time token contexts. They demonstrate that removing or mitigating this vulnerability requires robustness-in-depth, including template-aware defenses and targeted fine-tuning (e.g., prepending spaces during LoRA-finetuning) to reduce the propensity for unsafe outputs. The findings highlight the practical importance of documenting conversation templates, improving alignment robustness, and developing token-level defenses to bolster safety in real-world deployments.

Abstract

When LLMs are deployed in sensitive, human-facing settings, it is crucial that they do not output unsafe, biased, or privacy-violating outputs. For this reason, models are both trained and instructed to refuse to answer unsafe prompts such as "Tell me how to build a bomb." We find that, despite these safeguards, it is possible to break model defenses simply by appending a space to the end of a model's input. In a study of eight open-source models, we demonstrate that this acts as a strong enough attack to cause the majority of models to generate harmful outputs with very high success rates. We examine the causes of this behavior, finding that the contexts in which single spaces occur in tokenized training data encourage models to generate lists when prompted, overriding training signals to refuse to answer unsafe requests. Our findings underscore the fragile state of current model alignment and promote the importance of developing more robust alignment methods. Code and data will be available at https://github.com/hannah-aught/space_attack.
Paper Structure (40 sections, 1 theorem, 4 equations, 8 figures, 14 tables)

This paper contains 40 sections, 1 theorem, 4 equations, 8 figures, 14 tables.

Key Result

Proposition E.1

Suppose that $|Q| < d$ and $H$ are positive definite. Then, for any $\epsilon >0$, there exists $s \in \mathbb{R}^d$ such that $\tilde{A}_j(q) < \epsilon$ for all $q \in Q$, $(x_r)_{r\neq j} \in \mathcal{X}^{T-1}$, and $j \in [T]$.

Figures (8)

  • Figure 1: When a user queries a chat model, this input is put into a chat template, and this template is given to a model for inference. By appending a space to the end of this template, we can circumvent model alignment.
  • Figure 2: Example of the application of a chat template for Vicuna
  • Figure 3: ASR for 7B models with different punctuation appended to the end of the template. We report the ASR for the top three tokens here. Full results on all punctuation tokens can be found in \ref{['app:punctuation_results']}
  • Figure 4: Mean overlaps in top-k predicted next tokens before and after appending space to model templates for $k\in\{5,10,30,100\}$.
  • Figure 5: Percent of tokens of each type following a single space token for each model tokenizer. Guanaco and Vicuna are excluded as they use the Llama-2 tokenizer.
  • ...and 3 more figures

Theorems & Definitions (2)

  • Proposition E.1
  • proof