When "Competency" in Reasoning Opens the Door to Vulnerability: Jailbreaking LLMs via Novel Complex Ciphers

Divij Handa; Zehua Zhang; Amir Saeidi; Shrinidhi Kumbhar; Md Nayem Uddin; Aswin RRV; Chitta Baral

When "Competency" in Reasoning Opens the Door to Vulnerability: Jailbreaking LLMs via Novel Complex Ciphers

Divij Handa, Zehua Zhang, Amir Saeidi, Shrinidhi Kumbhar, Md Nayem Uddin, Aswin RRV, Chitta Baral

TL;DR

<3-5 sentence high-level summary> LLM safety defenses can fail against novel, user-created ciphers as models’ reasoning improves, creating a paradox where more capable models are more jailbroken. The authors propose Attack using Custom Encryptions (ACE) and Layered Attacks using Custom Encryptions (LACE) and introduce CipherBench to quantify decryption abilities. Empirical results across multiple open-source and proprietary models show that larger, thinking LLMs achieve higher decryption success on complex, layered ciphers, revealing a safety-reasoning tradeoff and instances of over-defensiveness. The work highlights the need for defenses that anticipate unknown, multi-layer cipher attacks and suggests directions for strengthening safety training beyond known ciphers.

Abstract

Recent advancements in Large Language Model (LLM) safety have primarily focused on mitigating attacks crafted in natural language or common ciphers (e.g. Base64), which are likely integrated into newer models' safety training. However, we reveal a paradoxical vulnerability: as LLMs advance in reasoning, they inadvertently become more susceptible to novel jailbreaking attacks. Enhanced reasoning enables LLMs to interpret complex instructions and decode complex user-defined ciphers, creating an exploitable security gap. To study this vulnerability, we introduce Attacks using Custom Encryptions (ACE), a jailbreaking technique that encodes malicious queries with novel ciphers. Extending ACE, we introduce Layered Attacks using Custom Encryptions (LACE), which applies multi-layer ciphers to amplify attack complexity. Furthermore, we develop CipherBench, a benchmark designed to evaluate LLMs' accuracy in decoding encrypted benign text. Our experiments reveal a critical trade-off: LLMs that are more capable of decoding ciphers are more vulnerable to LACE, with success rates on gpt-oss-20b escalating from 60% under ACE to 72% with LACE. These findings highlight a critical insight: as LLMs become more adept at deciphering complex user ciphers--many of which cannot be preemptively included in safety training--they become increasingly exploitable.

When "Competency" in Reasoning Opens the Door to Vulnerability: Jailbreaking LLMs via Novel Complex Ciphers

TL;DR

Abstract

Paper Structure (59 sections, 2 equations, 6 figures, 7 tables)

This paper contains 59 sections, 2 equations, 6 figures, 7 tables.

Introduction
Related Works
Optimization-based Jailbreak Attacks
Prompt-based Jailbreak Attacks
Decryption Capabilities of LLMs
CipherBench: Evaluating Decoding Capabilities of LLMs
Benchmark Design
Ciphers
Instances
Models & Evaluation Metrics
Performance on CipherBench
Experimental Setup for Jailbreaking
Models
Dataset
Encryption Ciphers
...and 44 more sections

Figures (6)

Figure 1: Our first jailbreaking method, Attack using Custom Encryptions (ACE), uses novel user-designed ciphers to jailbreak LLMs, achieving a higher attack success rate (ASR) compared to both common and uncommon ciphers. Building upon ACE, Layered Attacks using Custom Encryptions (LACE), combines two encryption techniques, such as Word Substitution Cipher followed by Word Reversal Cipher. All ASRs displayed correspond to gpt-oss-20b.
Figure 2: Decryption Success Rate (DSR) of LLMs on CipherBench. Poor performance is observed when sentences are modified to contain random characters instead of English sentences. This indicates the models' reliance on recovering sentences from statistical patterns instead of understanding the decryption algorithm.
Figure 3: Decryption Success Rate (DSR) of LLMs across 10 ciphers present in CipherBench
Figure 4: Heatmap showing the ASR of ACE across all 14 categories using all uncommon or novel ciphers for GPT-4o.
Figure 5: Plot showing the number of instances that are over-defended across all five encryptions used in method ACE.
...and 1 more figures

When "Competency" in Reasoning Opens the Door to Vulnerability: Jailbreaking LLMs via Novel Complex Ciphers

TL;DR

Abstract

When "Competency" in Reasoning Opens the Door to Vulnerability: Jailbreaking LLMs via Novel Complex Ciphers

Authors

TL;DR

Abstract

Table of Contents

Figures (6)