Table of Contents
Fetching ...

When "Competency" in Reasoning Opens the Door to Vulnerability: Jailbreaking LLMs via Novel Complex Ciphers

Divij Handa, Zehua Zhang, Amir Saeidi, Shrinidhi Kumbhar, Md Nayem Uddin, Aswin RRV, Chitta Baral

TL;DR

<3-5 sentence high-level summary> LLM safety defenses can fail against novel, user-created ciphers as models’ reasoning improves, creating a paradox where more capable models are more jailbroken. The authors propose Attack using Custom Encryptions (ACE) and Layered Attacks using Custom Encryptions (LACE) and introduce CipherBench to quantify decryption abilities. Empirical results across multiple open-source and proprietary models show that larger, thinking LLMs achieve higher decryption success on complex, layered ciphers, revealing a safety-reasoning tradeoff and instances of over-defensiveness. The work highlights the need for defenses that anticipate unknown, multi-layer cipher attacks and suggests directions for strengthening safety training beyond known ciphers.

Abstract

Recent advancements in Large Language Model (LLM) safety have primarily focused on mitigating attacks crafted in natural language or common ciphers (e.g. Base64), which are likely integrated into newer models' safety training. However, we reveal a paradoxical vulnerability: as LLMs advance in reasoning, they inadvertently become more susceptible to novel jailbreaking attacks. Enhanced reasoning enables LLMs to interpret complex instructions and decode complex user-defined ciphers, creating an exploitable security gap. To study this vulnerability, we introduce Attacks using Custom Encryptions (ACE), a jailbreaking technique that encodes malicious queries with novel ciphers. Extending ACE, we introduce Layered Attacks using Custom Encryptions (LACE), which applies multi-layer ciphers to amplify attack complexity. Furthermore, we develop CipherBench, a benchmark designed to evaluate LLMs' accuracy in decoding encrypted benign text. Our experiments reveal a critical trade-off: LLMs that are more capable of decoding ciphers are more vulnerable to LACE, with success rates on gpt-oss-20b escalating from 60% under ACE to 72% with LACE. These findings highlight a critical insight: as LLMs become more adept at deciphering complex user ciphers--many of which cannot be preemptively included in safety training--they become increasingly exploitable.

When "Competency" in Reasoning Opens the Door to Vulnerability: Jailbreaking LLMs via Novel Complex Ciphers

TL;DR

<3-5 sentence high-level summary> LLM safety defenses can fail against novel, user-created ciphers as models’ reasoning improves, creating a paradox where more capable models are more jailbroken. The authors propose Attack using Custom Encryptions (ACE) and Layered Attacks using Custom Encryptions (LACE) and introduce CipherBench to quantify decryption abilities. Empirical results across multiple open-source and proprietary models show that larger, thinking LLMs achieve higher decryption success on complex, layered ciphers, revealing a safety-reasoning tradeoff and instances of over-defensiveness. The work highlights the need for defenses that anticipate unknown, multi-layer cipher attacks and suggests directions for strengthening safety training beyond known ciphers.

Abstract

Recent advancements in Large Language Model (LLM) safety have primarily focused on mitigating attacks crafted in natural language or common ciphers (e.g. Base64), which are likely integrated into newer models' safety training. However, we reveal a paradoxical vulnerability: as LLMs advance in reasoning, they inadvertently become more susceptible to novel jailbreaking attacks. Enhanced reasoning enables LLMs to interpret complex instructions and decode complex user-defined ciphers, creating an exploitable security gap. To study this vulnerability, we introduce Attacks using Custom Encryptions (ACE), a jailbreaking technique that encodes malicious queries with novel ciphers. Extending ACE, we introduce Layered Attacks using Custom Encryptions (LACE), which applies multi-layer ciphers to amplify attack complexity. Furthermore, we develop CipherBench, a benchmark designed to evaluate LLMs' accuracy in decoding encrypted benign text. Our experiments reveal a critical trade-off: LLMs that are more capable of decoding ciphers are more vulnerable to LACE, with success rates on gpt-oss-20b escalating from 60% under ACE to 72% with LACE. These findings highlight a critical insight: as LLMs become more adept at deciphering complex user ciphers--many of which cannot be preemptively included in safety training--they become increasingly exploitable.
Paper Structure (59 sections, 2 equations, 6 figures, 7 tables)

This paper contains 59 sections, 2 equations, 6 figures, 7 tables.

Figures (6)

  • Figure 1: Our first jailbreaking method, Attack using Custom Encryptions (ACE), uses novel user-designed ciphers to jailbreak LLMs, achieving a higher attack success rate (ASR) compared to both common and uncommon ciphers. Building upon ACE, Layered Attacks using Custom Encryptions (LACE), combines two encryption techniques, such as Word Substitution Cipher followed by Word Reversal Cipher. All ASRs displayed correspond to gpt-oss-20b.
  • Figure 2: Decryption Success Rate (DSR) of LLMs on CipherBench. Poor performance is observed when sentences are modified to contain random characters instead of English sentences. This indicates the models' reliance on recovering sentences from statistical patterns instead of understanding the decryption algorithm.
  • Figure 3: Decryption Success Rate (DSR) of LLMs across 10 ciphers present in CipherBench
  • Figure 4: Heatmap showing the ASR of ACE across all 14 categories using all uncommon or novel ciphers for GPT-4o.
  • Figure 5: Plot showing the number of instances that are over-defended across all five encryptions used in method ACE.
  • ...and 1 more figures