Table of Contents
Fetching ...

ARMOR: Aligning Secure and Safe Large Language Models via Meticulous Reasoning

Zhengyue Zhao, Yingzi Ma, Somesh Jha, Marco Pavone, Patrick McDaniel, Chaowei Xiao

TL;DR

ARMOR tackles the vulnerability of safety-aligned LLMs to jailbreaks, especially OOD attacks, by introducing a three-step Meticulous Reasoning pipeline that extracts the core malicious intent using an external strategy library and policy-based safety analysis. It further enhances efficiency with ARMOR-Think, separating safety reasoning from general reasoning and enabling free thinking for benign prompts. Empirical results show ARMOR achieves state-of-the-art safety, with an average ASR of $0.06$ against advanced jailbreaks and a harmful-output rate of $0.002$, while generalizing to unseen strategies; ARMOR-Think improves utility and cuts safety-thinking length substantially. These findings offer a practical pathway to secure LLM deployment in adversarial settings, balancing safety with usefulness and demonstrating strong extrapolation to novel jailbreak methods.

Abstract

Large Language Models have shown impressive generative capabilities across diverse tasks, but their safety remains a critical concern. Existing post-training alignment methods, such as SFT and RLHF, reduce harmful outputs yet leave LLMs vulnerable to jailbreak attacks, especially advanced optimization-based ones. Recent system-2 approaches enhance safety by adding inference-time reasoning, where models assess potential risks before producing responses. However, we find these methods fail against powerful out-of-distribution jailbreaks, such as AutoDAN-Turbo and Adversarial Reasoning, which conceal malicious goals behind seemingly benign prompts. We observe that all jailbreaks ultimately aim to embed a core malicious intent, suggesting that extracting this intent is key to defense. To this end, we propose ARMOR, which introduces a structured three-step reasoning pipeline: (1) analyze jailbreak strategies from an external, updatable strategy library, (2) extract the core intent, and (3) apply policy-based safety verification. We further develop ARMOR-Think, which decouples safety reasoning from general reasoning to improve both robustness and utility. Evaluations on advanced optimization-based jailbreaks and safety benchmarks show that ARMOR achieves state-of-the-art safety performance, with an average harmful rate of 0.002 and an attack success rate of 0.06 against advanced optimization-based jailbreaks, far below other reasoning-based models. Moreover, ARMOR demonstrates strong generalization to unseen jailbreak strategies, reducing their success rate to zero. These highlight ARMOR's effectiveness in defending against OOD jailbreak attacks, offering a practical path toward secure and reliable LLMs.

ARMOR: Aligning Secure and Safe Large Language Models via Meticulous Reasoning

TL;DR

ARMOR tackles the vulnerability of safety-aligned LLMs to jailbreaks, especially OOD attacks, by introducing a three-step Meticulous Reasoning pipeline that extracts the core malicious intent using an external strategy library and policy-based safety analysis. It further enhances efficiency with ARMOR-Think, separating safety reasoning from general reasoning and enabling free thinking for benign prompts. Empirical results show ARMOR achieves state-of-the-art safety, with an average ASR of against advanced jailbreaks and a harmful-output rate of , while generalizing to unseen strategies; ARMOR-Think improves utility and cuts safety-thinking length substantially. These findings offer a practical pathway to secure LLM deployment in adversarial settings, balancing safety with usefulness and demonstrating strong extrapolation to novel jailbreak methods.

Abstract

Large Language Models have shown impressive generative capabilities across diverse tasks, but their safety remains a critical concern. Existing post-training alignment methods, such as SFT and RLHF, reduce harmful outputs yet leave LLMs vulnerable to jailbreak attacks, especially advanced optimization-based ones. Recent system-2 approaches enhance safety by adding inference-time reasoning, where models assess potential risks before producing responses. However, we find these methods fail against powerful out-of-distribution jailbreaks, such as AutoDAN-Turbo and Adversarial Reasoning, which conceal malicious goals behind seemingly benign prompts. We observe that all jailbreaks ultimately aim to embed a core malicious intent, suggesting that extracting this intent is key to defense. To this end, we propose ARMOR, which introduces a structured three-step reasoning pipeline: (1) analyze jailbreak strategies from an external, updatable strategy library, (2) extract the core intent, and (3) apply policy-based safety verification. We further develop ARMOR-Think, which decouples safety reasoning from general reasoning to improve both robustness and utility. Evaluations on advanced optimization-based jailbreaks and safety benchmarks show that ARMOR achieves state-of-the-art safety performance, with an average harmful rate of 0.002 and an attack success rate of 0.06 against advanced optimization-based jailbreaks, far below other reasoning-based models. Moreover, ARMOR demonstrates strong generalization to unseen jailbreak strategies, reducing their success rate to zero. These highlight ARMOR's effectiveness in defending against OOD jailbreak attacks, offering a practical path toward secure and reliable LLMs.

Paper Structure

This paper contains 28 sections, 9 equations, 21 figures, 21 tables.

Figures (21)

  • Figure 1: ASR of Adversarial Reasoning against models.
  • Figure 2: Utility results on general benchmarks of ARMOR and the base model.
  • Figure 3: Reasoning-based safety-aligned LLMs mislead by the advanced optimization-based jailbreak prompt and falsely catch the intent, resulting in a misaligned output. In contrast, ARMOR extracts the core intent of the instruction with a jailbreak strategy analysis, along with a policy-based safety analysis, demonstrating robustness to advanced optimization-based jailbreak attacks.
  • Figure 3: Ablation study on the strategy analysis step. The model w/o strategy analysis is trained with data that does not contain the strategy analysis step.
  • Figure 4: The framework of ARMOR consists of the following steps: (1) Construct the Meticulous Reasoning steps with jailbreak prompts, their coordinate ground truth (GT) jailbreak strategy and intent, and the safety policy; (2) Format the reasoning steps with inputs involving the user's prompts and the system prompt consists of a dynamic strategy library and the safety policy; (3) Train the base model to get the ARMOR model; (4) Conduct inference of ARMOR with a custom strategy library and the safety policy; (5) Conduct test-time scaling with the DPO model and PRM trained on preference data generated from grounded tree sampling.
  • ...and 16 more figures