Bypassing Safety Guardrails in LLMs Using Humor
Pedro Cisneros-Velarde
TL;DR
The paper investigates whether humor can be used to bypass safety guardrails in LLMs. It proposes a simple fixed-template attack that preserves the unsafe request verbatim while adding a humorous framing, and it evaluates single-turn and multi-turn variants across three datasets and four open-source models. The results show that humor-based prompts yield unsafe responses in 42 of 48 cases, with humor playing a crucial role and excessive humor sometimes reducing effectiveness, suggesting a balance is needed. The work highlights gaps in safety training generalization to humorous contexts and argues for more robust, humor-aware defenses.
Abstract
In this paper, we show it is possible to bypass the safety guardrails of large language models (LLMs) through a humorous prompt including the unsafe request. In particular, our method does not edit the unsafe request and follows a fixed template -- it is simple to implement and does not need additional LLMs to craft prompts. Extensive experiments show the effectiveness of our method across different LLMs. We also show that both removing and adding more humor to our method can reduce its effectiveness -- excessive humor possibly distracts the LLM from fulfilling its unsafe request. Thus, we argue that LLM jailbreaking occurs when there is a proper balance between focus on the unsafe request and presence of humor.
