Three Minds, One Legend: Jailbreak Large Reasoning Model with Adaptive Stacked Ciphers
Viet-Anh Nguyen, Shiqian Zhao, Gia Dao, Runyi Hu, Yi Xie, Luu Anh Tuan
TL;DR
This work reveals that the stronger reasoning abilities of Large Reasoning Models can simultaneously improve performance and expose new vulnerabilities to adaptive jailbreaks. It introduces SEAL, an adaptive jailbreak that stacks multiple ciphers to obfuscate unsafe prompts and employs a gradient-bandit policy to dynamically pick cipher groups, balancing decryption feasibility with attack effectiveness. Through extensive experiments on real-world LRMs, SEAL achieves high attack success rates, including up to 100% on several models and strong transferability to unseen systems, significantly outperforming prior baselines. The findings underscore the urgent need for defenses that can withstand encryption-based, adaptive jailbreaks as reasoning capabilities continue to advance.
Abstract
Recently, Large Reasoning Models (LRMs) have demonstrated superior logical capabilities compared to traditional Large Language Models (LLMs), gaining significant attention. Despite their impressive performance, the potential for stronger reasoning abilities to introduce more severe security vulnerabilities remains largely underexplored. Existing jailbreak methods often struggle to balance effectiveness with robustness against adaptive safety mechanisms. In this work, we propose SEAL, a novel jailbreak attack that targets LRMs through an adaptive encryption pipeline designed to override their reasoning processes and evade potential adaptive alignment. Specifically, SEAL introduces a stacked encryption approach that combines multiple ciphers to overwhelm the models reasoning capabilities, effectively bypassing built-in safety mechanisms. To further prevent LRMs from developing countermeasures, we incorporate two dynamic strategies - random and adaptive - that adjust the cipher length, order, and combination. Extensive experiments on real-world reasoning models, including DeepSeek-R1, Claude Sonnet, and OpenAI GPT-o4, validate the effectiveness of our approach. Notably, SEAL achieves an attack success rate of 80.8% on GPT o4-mini, outperforming state-of-the-art baselines by a significant margin of 27.2%. Warning: This paper contains examples of inappropriate, offensive, and harmful content.
