CodeChameleon: Personalized Encryption Framework for Jailbreaking Large Language Models
Huijie Lv, Xiao Wang, Yuansen Zhang, Caishuang Huang, Shihan Dou, Junjie Ye, Tao Gui, Qi Zhang, Xuanjing Huang
TL;DR
This work investigates jailbreaking of large language models by proposing a safety mechanism hypothesis: that models first perform intent security recognition and then generate responses. It introduces CodeChameleon, a framework using personalized encryption to conceal intent and embedded decryption to preserve task execution, recasting prompts as code-completion tasks to exploit code capabilities. Across seven LLMs, CodeChameleon achieves a high average Attack Success Rate of 77.5%, with GPT-4-1106 reaching 86.6%, revealing that larger or more code-capable models can remain vulnerable without stronger safety alignment. The findings underscore the need for robust defenses against code-style, encrypted jailbreak prompts and motivate continued work on aligning safety protocols with advanced model capabilities.
Abstract
Adversarial misuse, particularly through `jailbreaking' that circumvents a model's safety and ethical protocols, poses a significant challenge for Large Language Models (LLMs). This paper delves into the mechanisms behind such successful attacks, introducing a hypothesis for the safety mechanism of aligned LLMs: intent security recognition followed by response generation. Grounded in this hypothesis, we propose CodeChameleon, a novel jailbreak framework based on personalized encryption tactics. To elude the intent security recognition phase, we reformulate tasks into a code completion format, enabling users to encrypt queries using personalized encryption functions. To guarantee response generation functionality, we embed a decryption function within the instructions, which allows the LLM to decrypt and execute the encrypted queries successfully. We conduct extensive experiments on 7 LLMs, achieving state-of-the-art average Attack Success Rate (ASR). Remarkably, our method achieves an 86.6\% ASR on GPT-4-1106.
