ARMOR: Aligning Secure and Safe Large Language Models via Meticulous Reasoning
Zhengyue Zhao, Yingzi Ma, Somesh Jha, Marco Pavone, Patrick McDaniel, Chaowei Xiao
TL;DR
ARMOR tackles the vulnerability of safety-aligned LLMs to jailbreaks, especially OOD attacks, by introducing a three-step Meticulous Reasoning pipeline that extracts the core malicious intent using an external strategy library and policy-based safety analysis. It further enhances efficiency with ARMOR-Think, separating safety reasoning from general reasoning and enabling free thinking for benign prompts. Empirical results show ARMOR achieves state-of-the-art safety, with an average ASR of $0.06$ against advanced jailbreaks and a harmful-output rate of $0.002$, while generalizing to unseen strategies; ARMOR-Think improves utility and cuts safety-thinking length substantially. These findings offer a practical pathway to secure LLM deployment in adversarial settings, balancing safety with usefulness and demonstrating strong extrapolation to novel jailbreak methods.
Abstract
Large Language Models have shown impressive generative capabilities across diverse tasks, but their safety remains a critical concern. Existing post-training alignment methods, such as SFT and RLHF, reduce harmful outputs yet leave LLMs vulnerable to jailbreak attacks, especially advanced optimization-based ones. Recent system-2 approaches enhance safety by adding inference-time reasoning, where models assess potential risks before producing responses. However, we find these methods fail against powerful out-of-distribution jailbreaks, such as AutoDAN-Turbo and Adversarial Reasoning, which conceal malicious goals behind seemingly benign prompts. We observe that all jailbreaks ultimately aim to embed a core malicious intent, suggesting that extracting this intent is key to defense. To this end, we propose ARMOR, which introduces a structured three-step reasoning pipeline: (1) analyze jailbreak strategies from an external, updatable strategy library, (2) extract the core intent, and (3) apply policy-based safety verification. We further develop ARMOR-Think, which decouples safety reasoning from general reasoning to improve both robustness and utility. Evaluations on advanced optimization-based jailbreaks and safety benchmarks show that ARMOR achieves state-of-the-art safety performance, with an average harmful rate of 0.002 and an attack success rate of 0.06 against advanced optimization-based jailbreaks, far below other reasoning-based models. Moreover, ARMOR demonstrates strong generalization to unseen jailbreak strategies, reducing their success rate to zero. These highlight ARMOR's effectiveness in defending against OOD jailbreak attacks, offering a practical path toward secure and reliable LLMs.
