Table of Contents
Fetching ...

CodeChameleon: Personalized Encryption Framework for Jailbreaking Large Language Models

Huijie Lv, Xiao Wang, Yuansen Zhang, Caishuang Huang, Shihan Dou, Junjie Ye, Tao Gui, Qi Zhang, Xuanjing Huang

TL;DR

This work investigates jailbreaking of large language models by proposing a safety mechanism hypothesis: that models first perform intent security recognition and then generate responses. It introduces CodeChameleon, a framework using personalized encryption to conceal intent and embedded decryption to preserve task execution, recasting prompts as code-completion tasks to exploit code capabilities. Across seven LLMs, CodeChameleon achieves a high average Attack Success Rate of 77.5%, with GPT-4-1106 reaching 86.6%, revealing that larger or more code-capable models can remain vulnerable without stronger safety alignment. The findings underscore the need for robust defenses against code-style, encrypted jailbreak prompts and motivate continued work on aligning safety protocols with advanced model capabilities.

Abstract

Adversarial misuse, particularly through `jailbreaking' that circumvents a model's safety and ethical protocols, poses a significant challenge for Large Language Models (LLMs). This paper delves into the mechanisms behind such successful attacks, introducing a hypothesis for the safety mechanism of aligned LLMs: intent security recognition followed by response generation. Grounded in this hypothesis, we propose CodeChameleon, a novel jailbreak framework based on personalized encryption tactics. To elude the intent security recognition phase, we reformulate tasks into a code completion format, enabling users to encrypt queries using personalized encryption functions. To guarantee response generation functionality, we embed a decryption function within the instructions, which allows the LLM to decrypt and execute the encrypted queries successfully. We conduct extensive experiments on 7 LLMs, achieving state-of-the-art average Attack Success Rate (ASR). Remarkably, our method achieves an 86.6\% ASR on GPT-4-1106.

CodeChameleon: Personalized Encryption Framework for Jailbreaking Large Language Models

TL;DR

This work investigates jailbreaking of large language models by proposing a safety mechanism hypothesis: that models first perform intent security recognition and then generate responses. It introduces CodeChameleon, a framework using personalized encryption to conceal intent and embedded decryption to preserve task execution, recasting prompts as code-completion tasks to exploit code capabilities. Across seven LLMs, CodeChameleon achieves a high average Attack Success Rate of 77.5%, with GPT-4-1106 reaching 86.6%, revealing that larger or more code-capable models can remain vulnerable without stronger safety alignment. The findings underscore the need for robust defenses against code-style, encrypted jailbreak prompts and motivate continued work on aligning safety protocols with advanced model capabilities.

Abstract

Adversarial misuse, particularly through `jailbreaking' that circumvents a model's safety and ethical protocols, poses a significant challenge for Large Language Models (LLMs). This paper delves into the mechanisms behind such successful attacks, introducing a hypothesis for the safety mechanism of aligned LLMs: intent security recognition followed by response generation. Grounded in this hypothesis, we propose CodeChameleon, a novel jailbreak framework based on personalized encryption tactics. To elude the intent security recognition phase, we reformulate tasks into a code completion format, enabling users to encrypt queries using personalized encryption functions. To guarantee response generation functionality, we embed a decryption function within the instructions, which allows the LLM to decrypt and execute the encrypted queries successfully. We conduct extensive experiments on 7 LLMs, achieving state-of-the-art average Attack Success Rate (ASR). Remarkably, our method achieves an 86.6\% ASR on GPT-4-1106.
Paper Structure (39 sections, 10 figures, 5 tables)

This paper contains 39 sections, 10 figures, 5 tables.

Figures (10)

  • Figure 1: We propose a safety mechanism hypothesis: intent security recognition and response generation. Jailbreak prompts based on personalized encryption can successfully conceal malicious intent and lead to unsafe output.
  • Figure 2: Overview of CodeChameleon. Initially, we utilize a personalized encryption function to transform the unsafe query into an encrypted format. Subsequently, the decryption function and encrypted query are embedded into a code-style instruction template to generate the jailbreak prompt.
  • Figure 3: Comparing ASR performance for text-style and code-style instructions. We adopt three experimental setups: Without Encryption and Decryption (w/o en_de), With Encryption Only (w/ en), and With Encryption and Decryption (w/ en_de).
  • Figure 4: Our design of four encryption functions
  • Figure 5: Our design of four decryption functions
  • ...and 5 more figures