Table of Contents
Fetching ...

ChatBug: A Common Vulnerability of Aligned LLMs Induced by Chat Templates

Fengqing Jiang, Zhangchen Xu, Luyao Niu, Bill Yuchen Lin, Radha Poovendran

TL;DR

This work identifies ChatBug, a common safety-alignment vulnerability induced by rigid chat templates used in instruction tuning. It presents two attacks—format mismatch and message overflow—that exploit template structure to provoke unsafe outputs across eight SOTA LLMs and to amplify jailbreak attacks. Experimental results quantify the severity and transferability of ChatBug using AdvBench, showing substantial ASR increases and a clear safety-utility trade-off when applying defenses such as adversarial training. The paper highlights the need for new instruction-tuning approaches that better balance safety and helpfulness, and discusses detection-based strategies as potential complements to mitigation. Overall, ChatBug reveals a practical risk in current alignment pipelines and calls for collaborative efforts to develop safer, more robust instruction-following models.

Abstract

Large language models (LLMs) are expected to follow instructions from users and engage in conversations. Techniques to enhance LLMs' instruction-following capabilities typically fine-tune them using data structured according to a predefined chat template. Although chat templates are shown to be effective in optimizing LLM performance, their impact on safety alignment of LLMs has been less understood, which is crucial for deploying LLMs safely at scale. In this paper, we investigate how chat templates affect safety alignment of LLMs. We identify a common vulnerability, named ChatBug, that is introduced by chat templates. Our key insight to identify ChatBug is that the chat templates provide a rigid format that need to be followed by LLMs, but not by users. Hence, a malicious user may not necessarily follow the chat template when prompting LLMs. Instead, malicious users could leverage their knowledge of the chat template and accordingly craft their prompts to bypass safety alignments of LLMs. We develop two attacks to exploit the ChatBug vulnerability. We demonstrate that a malicious user can exploit the ChatBug vulnerability of eight state-of-the-art (SOTA) LLMs and effectively elicit unintended responses from these models. Moreover, we show that ChatBug can be exploited by existing jailbreak attacks to enhance their attack success rates. We investigate potential countermeasures to ChatBug. Our results show that while adversarial training effectively mitigates the ChatBug vulnerability, the victim model incurs significant performance degradation. These results highlight the trade-off between safety alignment and helpfulness. Developing new methods for instruction tuning to balance this trade-off is an open and critical direction for future research

ChatBug: A Common Vulnerability of Aligned LLMs Induced by Chat Templates

TL;DR

This work identifies ChatBug, a common safety-alignment vulnerability induced by rigid chat templates used in instruction tuning. It presents two attacks—format mismatch and message overflow—that exploit template structure to provoke unsafe outputs across eight SOTA LLMs and to amplify jailbreak attacks. Experimental results quantify the severity and transferability of ChatBug using AdvBench, showing substantial ASR increases and a clear safety-utility trade-off when applying defenses such as adversarial training. The paper highlights the need for new instruction-tuning approaches that better balance safety and helpfulness, and discusses detection-based strategies as potential complements to mitigation. Overall, ChatBug reveals a practical risk in current alignment pipelines and calls for collaborative efforts to develop safer, more robust instruction-following models.

Abstract

Large language models (LLMs) are expected to follow instructions from users and engage in conversations. Techniques to enhance LLMs' instruction-following capabilities typically fine-tune them using data structured according to a predefined chat template. Although chat templates are shown to be effective in optimizing LLM performance, their impact on safety alignment of LLMs has been less understood, which is crucial for deploying LLMs safely at scale. In this paper, we investigate how chat templates affect safety alignment of LLMs. We identify a common vulnerability, named ChatBug, that is introduced by chat templates. Our key insight to identify ChatBug is that the chat templates provide a rigid format that need to be followed by LLMs, but not by users. Hence, a malicious user may not necessarily follow the chat template when prompting LLMs. Instead, malicious users could leverage their knowledge of the chat template and accordingly craft their prompts to bypass safety alignments of LLMs. We develop two attacks to exploit the ChatBug vulnerability. We demonstrate that a malicious user can exploit the ChatBug vulnerability of eight state-of-the-art (SOTA) LLMs and effectively elicit unintended responses from these models. Moreover, we show that ChatBug can be exploited by existing jailbreak attacks to enhance their attack success rates. We investigate potential countermeasures to ChatBug. Our results show that while adversarial training effectively mitigates the ChatBug vulnerability, the victim model incurs significant performance degradation. These results highlight the trade-off between safety alignment and helpfulness. Developing new methods for instruction tuning to balance this trade-off is an open and critical direction for future research
Paper Structure (50 sections, 4 equations, 7 figures, 8 tables)

This paper contains 50 sections, 4 equations, 7 figures, 8 tables.

Figures (7)

  • Figure 1: This figure illustrates how the format mismatch attack and message overflow attack exploit ChatBug. The format mismatch attack alters the default chat format () to bypass safety alignment of LLMs. The message overflow attack inserts a short sequence of tokens () into the field reserved for the aligned LLM to bypass safety alignment.
  • Figure 2: This figure shows how the ratio $\frac{P_\mathcal{M}(\cdot|\hat{x}_{1:n})}{P_\mathcal{M}(\cdot|x_{1:n})}$ evolves at each decoding step $n$ (i.e., the number of response tokens), with the results averaged over 50 instructions. The format mismatch attack significantly increases the probability of generating the desired harmful response.
  • Figure 3: This figure presents the probability of generating the desired harmful response when the number of overflow tokens varies from 0 to 9, averaged over 50 instructions. Note that the user does not launch message overflow attack when the number of overflow tokens is zero. The results show that the probability of generating the desired harmful response increases as the user overflows more tokens.
  • Figure 4: This figure shows how ASR evolves as the number of shots used by Overflow-FS increases. The results show that as Overflow-FS uses more shots, the ASR monotonically increases, regardless of the evaluation method. This indicates the effectiveness of Overflow-FS and thus the severity of ChatBug.
  • Figure 5: Vicuna Template
  • ...and 2 more figures