Table of Contents
Fetching ...

Towards Safety Reasoning in LLMs: AI-agentic Deliberation for Policy-embedded CoT Data Creation

Tharindu Kumarage, Ninareh Mehrabi, Anil Ramakrishna, Xinyan Zhao, Richard Zemel, Kai-Wei Chang, Aram Galstyan, Rahul Gupta, Charith Peris

TL;DR

This work tackles data bottlenecks in safety reasoning for LLMs by introducing AIDSAFE, a multi-agent deliberation framework that iteratively expands and refines safety-relevant CoTs. By coupling five explicit safety policies with an initialization, deliberation, and refinement pipeline, it generates policy-embedded CoT data that improves policy adherence and jailbreak robustness when used for supervised fine-tuning on open-source LLMs. It further proposes an ear-whisperer mechanism to create distinct selected and rejected CoTs for Direct Policy Optimization, enabling more effective alignment. Across extensive evaluations, AIDSAFE demonstrates stronger safety generalization with manageable utility costs, and releases the CoT datasets to support ongoing research in safe AI systems.

Abstract

Safety reasoning is a recent paradigm where LLMs reason over safety policies before generating responses, thereby mitigating limitations in existing safety measures such as over-refusal and jailbreak vulnerabilities. However, implementing this paradigm is challenging due to the resource-intensive process of creating high-quality policy-embedded chain-of-thought (CoT) datasets while ensuring reasoning remains accurate and free from hallucinations or policy conflicts. To tackle this, we propose AIDSAFE: Agentic Iterative Deliberation for Safety Reasoning, a novel data generation recipe that leverages multi-agent deliberation to iteratively expand reasoning on safety policies. A data refiner stage in AIDSAFE ensures high-quality outputs by eliminating repetitive, redundant, and deceptive thoughts. AIDSAFE-generated CoTs provide a strong foundation for supervised fine-tuning (SFT)-based safety training. Additionally, to address the need of preference data in alignment stages, such as DPO training, we introduce a supplemental recipe that uses belief augmentation to create distinct selected and rejected CoT samples. Our evaluations demonstrate that AIDSAFE-generated CoTs achieve superior policy adherence and reasoning quality. Consequently, we show that fine-tuning open-source LLMs on these CoTs can significantly improve safety generalization and jailbreak robustness while maintaining acceptable utility and over-refusal accuracy. AIDSAFE-generated CoT datasets can be found here: https://huggingface.co/datasets/AmazonScience/AIDSAFE

Towards Safety Reasoning in LLMs: AI-agentic Deliberation for Policy-embedded CoT Data Creation

TL;DR

This work tackles data bottlenecks in safety reasoning for LLMs by introducing AIDSAFE, a multi-agent deliberation framework that iteratively expands and refines safety-relevant CoTs. By coupling five explicit safety policies with an initialization, deliberation, and refinement pipeline, it generates policy-embedded CoT data that improves policy adherence and jailbreak robustness when used for supervised fine-tuning on open-source LLMs. It further proposes an ear-whisperer mechanism to create distinct selected and rejected CoTs for Direct Policy Optimization, enabling more effective alignment. Across extensive evaluations, AIDSAFE demonstrates stronger safety generalization with manageable utility costs, and releases the CoT datasets to support ongoing research in safe AI systems.

Abstract

Safety reasoning is a recent paradigm where LLMs reason over safety policies before generating responses, thereby mitigating limitations in existing safety measures such as over-refusal and jailbreak vulnerabilities. However, implementing this paradigm is challenging due to the resource-intensive process of creating high-quality policy-embedded chain-of-thought (CoT) datasets while ensuring reasoning remains accurate and free from hallucinations or policy conflicts. To tackle this, we propose AIDSAFE: Agentic Iterative Deliberation for Safety Reasoning, a novel data generation recipe that leverages multi-agent deliberation to iteratively expand reasoning on safety policies. A data refiner stage in AIDSAFE ensures high-quality outputs by eliminating repetitive, redundant, and deceptive thoughts. AIDSAFE-generated CoTs provide a strong foundation for supervised fine-tuning (SFT)-based safety training. Additionally, to address the need of preference data in alignment stages, such as DPO training, we introduce a supplemental recipe that uses belief augmentation to create distinct selected and rejected CoT samples. Our evaluations demonstrate that AIDSAFE-generated CoTs achieve superior policy adherence and reasoning quality. Consequently, we show that fine-tuning open-source LLMs on these CoTs can significantly improve safety generalization and jailbreak robustness while maintaining acceptable utility and over-refusal accuracy. AIDSAFE-generated CoT datasets can be found here: https://huggingface.co/datasets/AmazonScience/AIDSAFE

Paper Structure

This paper contains 58 sections, 6 figures, 5 tables.

Figures (6)

  • Figure 1: Proposed Multi-agent Deliberation Framework to Generate Safety-embedded CoTs
  • Figure 2: Pairwise comparison of AIDsafe and LLM$_{ZS}$-generated CoTs, evaluated by Claude-3 Sonnet and Command. The bars show the proportion of AIDsafe wins (green), ties (gray), and LLM$_{ZS}$ wins (orange).
  • Figure 3: Comparison of model performance in terms of safety level and over-refusal accuracy. Higher safety levels and higher over-refusal accuracy are desirable.
  • Figure 4: Preference Data Quality - faithfulness measures to understand the policy adherence of the selected and rejected CoT data.
  • Figure 5: Preference Data Creation
  • ...and 1 more figures