Table of Contents
Fetching ...

Alignment-Weighted DPO: A principled reasoning approach to improve safety alignment

Mengxuan Hu, Vivek V. Datla, Anoop Kumar, Zihan Guan, Sheng Li, Alfy Samuel, Daben Liu

TL;DR

This work constructs and releases a novel Chain-of-Thought fine-tuning dataset and introduces Alignment-Weighted DPO, which targets the most problematic parts of an output by assigning different preference weights to the reasoning and final-answer segments and consistently improves alignment robustness while maintaining overall model utility.

Abstract

Recent advances in alignment techniques such as Supervised Fine-Tuning (SFT), Reinforcement Learning from Human Feedback (RLHF), and Direct Preference Optimization (DPO) have improved the safety of large language models (LLMs). However, these LLMs remain vulnerable to jailbreak attacks that disguise harmful intent through indirect or deceptive phrasing. Using causal intervention, we empirically demonstrate that this vulnerability stems from shallow alignment mechanisms that lack deep reasoning, often rejecting harmful prompts without truly understanding why they are harmful. To mitigate this vulnerability, we propose enhancing alignment through reasoning-aware post-training. We construct and release a novel Chain-of-Thought (CoT) fine-tuning dataset that includes both utility-oriented and safety-critical prompts with step-by-step rationales. Fine-tuning on this dataset encourages models to produce principled refusals grounded in reasoning, outperforming standard SFT baselines. Furthermore, inspired by failure patterns in CoT fine-tuning, we introduce Alignment-Weighted DPO, which targets the most problematic parts of an output by assigning different preference weights to the reasoning and final-answer segments. This produces finer-grained, targeted updates than vanilla DPO and improves robustness to diverse jailbreak strategies. Extensive experiments across multiple safety and utility benchmarks show that our method consistently improves alignment robustness while maintaining overall model utility.

Alignment-Weighted DPO: A principled reasoning approach to improve safety alignment

TL;DR

This work constructs and releases a novel Chain-of-Thought fine-tuning dataset and introduces Alignment-Weighted DPO, which targets the most problematic parts of an output by assigning different preference weights to the reasoning and final-answer segments and consistently improves alignment robustness while maintaining overall model utility.

Abstract

Recent advances in alignment techniques such as Supervised Fine-Tuning (SFT), Reinforcement Learning from Human Feedback (RLHF), and Direct Preference Optimization (DPO) have improved the safety of large language models (LLMs). However, these LLMs remain vulnerable to jailbreak attacks that disguise harmful intent through indirect or deceptive phrasing. Using causal intervention, we empirically demonstrate that this vulnerability stems from shallow alignment mechanisms that lack deep reasoning, often rejecting harmful prompts without truly understanding why they are harmful. To mitigate this vulnerability, we propose enhancing alignment through reasoning-aware post-training. We construct and release a novel Chain-of-Thought (CoT) fine-tuning dataset that includes both utility-oriented and safety-critical prompts with step-by-step rationales. Fine-tuning on this dataset encourages models to produce principled refusals grounded in reasoning, outperforming standard SFT baselines. Furthermore, inspired by failure patterns in CoT fine-tuning, we introduce Alignment-Weighted DPO, which targets the most problematic parts of an output by assigning different preference weights to the reasoning and final-answer segments. This produces finer-grained, targeted updates than vanilla DPO and improves robustness to diverse jailbreak strategies. Extensive experiments across multiple safety and utility benchmarks show that our method consistently improves alignment robustness while maintaining overall model utility.
Paper Structure (39 sections, 4 equations, 7 figures, 13 tables)

This paper contains 39 sections, 4 equations, 7 figures, 13 tables.

Figures (7)

  • Figure 1: Heatmap of Probing Accuracy for Original and Pruned Llama-2-7b-Chat and Mistral-7B-Instruct-v0.3 on Alignment and Reasoning Tasks.
  • Figure 2: AW-DPO Pipeline. Step 1: Generate $k$ candidate responses per prompt using the COT-finetuned LLM, and score their harmfulness on (i) reasoning ($h_{rs}$), (ii) response ($h_{rp}$), and (iii) full answer ($h_f$) using a judge model. Step 2: Select preference pairs $(x_{\text{chosen}}, x_{\text{rejected}})$ where the full harmfulness score difference exceeds threshold $\gamma$. Step 3: Compute alignment weights and train using $L_{\text{AW-DPO}}$.
  • Figure 3: Plot (a) shows the distribution within unsafe full responses. Plots (b) and (c) present the average safety and utility performance, compared to the corresponding open-source aligned models.
  • Figure 4: Plot (a) shows the performance improvements of our method on aligned chat models. Plots (b) and (c) show the ablation study comparing the safety and utility performance of standard DPO versus AW-DPO, respectively.
  • Figure 5: T-SNE Visualization of Embeddings from Attention Head 5 for Alignment and Reasoning Tasks Across All Layers.
  • ...and 2 more figures