Table of Contents
Fetching ...

Three Minds, One Legend: Jailbreak Large Reasoning Model with Adaptive Stacked Ciphers

Viet-Anh Nguyen, Shiqian Zhao, Gia Dao, Runyi Hu, Yi Xie, Luu Anh Tuan

TL;DR

This work reveals that the stronger reasoning abilities of Large Reasoning Models can simultaneously improve performance and expose new vulnerabilities to adaptive jailbreaks. It introduces SEAL, an adaptive jailbreak that stacks multiple ciphers to obfuscate unsafe prompts and employs a gradient-bandit policy to dynamically pick cipher groups, balancing decryption feasibility with attack effectiveness. Through extensive experiments on real-world LRMs, SEAL achieves high attack success rates, including up to 100% on several models and strong transferability to unseen systems, significantly outperforming prior baselines. The findings underscore the urgent need for defenses that can withstand encryption-based, adaptive jailbreaks as reasoning capabilities continue to advance.

Abstract

Recently, Large Reasoning Models (LRMs) have demonstrated superior logical capabilities compared to traditional Large Language Models (LLMs), gaining significant attention. Despite their impressive performance, the potential for stronger reasoning abilities to introduce more severe security vulnerabilities remains largely underexplored. Existing jailbreak methods often struggle to balance effectiveness with robustness against adaptive safety mechanisms. In this work, we propose SEAL, a novel jailbreak attack that targets LRMs through an adaptive encryption pipeline designed to override their reasoning processes and evade potential adaptive alignment. Specifically, SEAL introduces a stacked encryption approach that combines multiple ciphers to overwhelm the models reasoning capabilities, effectively bypassing built-in safety mechanisms. To further prevent LRMs from developing countermeasures, we incorporate two dynamic strategies - random and adaptive - that adjust the cipher length, order, and combination. Extensive experiments on real-world reasoning models, including DeepSeek-R1, Claude Sonnet, and OpenAI GPT-o4, validate the effectiveness of our approach. Notably, SEAL achieves an attack success rate of 80.8% on GPT o4-mini, outperforming state-of-the-art baselines by a significant margin of 27.2%. Warning: This paper contains examples of inappropriate, offensive, and harmful content.

Three Minds, One Legend: Jailbreak Large Reasoning Model with Adaptive Stacked Ciphers

TL;DR

This work reveals that the stronger reasoning abilities of Large Reasoning Models can simultaneously improve performance and expose new vulnerabilities to adaptive jailbreaks. It introduces SEAL, an adaptive jailbreak that stacks multiple ciphers to obfuscate unsafe prompts and employs a gradient-bandit policy to dynamically pick cipher groups, balancing decryption feasibility with attack effectiveness. Through extensive experiments on real-world LRMs, SEAL achieves high attack success rates, including up to 100% on several models and strong transferability to unseen systems, significantly outperforming prior baselines. The findings underscore the urgent need for defenses that can withstand encryption-based, adaptive jailbreaks as reasoning capabilities continue to advance.

Abstract

Recently, Large Reasoning Models (LRMs) have demonstrated superior logical capabilities compared to traditional Large Language Models (LLMs), gaining significant attention. Despite their impressive performance, the potential for stronger reasoning abilities to introduce more severe security vulnerabilities remains largely underexplored. Existing jailbreak methods often struggle to balance effectiveness with robustness against adaptive safety mechanisms. In this work, we propose SEAL, a novel jailbreak attack that targets LRMs through an adaptive encryption pipeline designed to override their reasoning processes and evade potential adaptive alignment. Specifically, SEAL introduces a stacked encryption approach that combines multiple ciphers to overwhelm the models reasoning capabilities, effectively bypassing built-in safety mechanisms. To further prevent LRMs from developing countermeasures, we incorporate two dynamic strategies - random and adaptive - that adjust the cipher length, order, and combination. Extensive experiments on real-world reasoning models, including DeepSeek-R1, Claude Sonnet, and OpenAI GPT-o4, validate the effectiveness of our approach. Notably, SEAL achieves an attack success rate of 80.8% on GPT o4-mini, outperforming state-of-the-art baselines by a significant margin of 27.2%. Warning: This paper contains examples of inappropriate, offensive, and harmful content.

Paper Structure

This paper contains 37 sections, 6 equations, 9 figures, 7 tables, 1 algorithm.

Figures (9)

  • Figure 1: Comparison of recovery rate and ASR of stacked ciphers against Claude 3.7 Sonnet with and without thinking mode. Here, the recovery rate indicates the LRMs' ability to solve problems. The definition can be found in Section \ref{['subsection:setup']}.
  • Figure 2: Overview of SEAL. In general, SEAL consistently modifies the adversarial prompt, with an adaptively sampled encryption algorithm set.
  • Figure 3: Performance of SEAL with random and adaptive strategies against different LRMs.
  • Figure 4: Comparison of ASR and recovery rate of SEAL using random strategy.
  • Figure 5: Example of original and new structure
  • ...and 4 more figures