Alignment-Weighted DPO: A principled reasoning approach to improve safety alignment

Mengxuan Hu; Vivek V. Datla; Anoop Kumar; Zihan Guan; Sheng Li; Alfy Samuel; Daben Liu

Alignment-Weighted DPO: A principled reasoning approach to improve safety alignment

Mengxuan Hu, Vivek V. Datla, Anoop Kumar, Zihan Guan, Sheng Li, Alfy Samuel, Daben Liu

TL;DR

This work constructs and releases a novel Chain-of-Thought fine-tuning dataset and introduces Alignment-Weighted DPO, which targets the most problematic parts of an output by assigning different preference weights to the reasoning and final-answer segments and consistently improves alignment robustness while maintaining overall model utility.

Abstract

Recent advances in alignment techniques such as Supervised Fine-Tuning (SFT), Reinforcement Learning from Human Feedback (RLHF), and Direct Preference Optimization (DPO) have improved the safety of large language models (LLMs). However, these LLMs remain vulnerable to jailbreak attacks that disguise harmful intent through indirect or deceptive phrasing. Using causal intervention, we empirically demonstrate that this vulnerability stems from shallow alignment mechanisms that lack deep reasoning, often rejecting harmful prompts without truly understanding why they are harmful. To mitigate this vulnerability, we propose enhancing alignment through reasoning-aware post-training. We construct and release a novel Chain-of-Thought (CoT) fine-tuning dataset that includes both utility-oriented and safety-critical prompts with step-by-step rationales. Fine-tuning on this dataset encourages models to produce principled refusals grounded in reasoning, outperforming standard SFT baselines. Furthermore, inspired by failure patterns in CoT fine-tuning, we introduce Alignment-Weighted DPO, which targets the most problematic parts of an output by assigning different preference weights to the reasoning and final-answer segments. This produces finer-grained, targeted updates than vanilla DPO and improves robustness to diverse jailbreak strategies. Extensive experiments across multiple safety and utility benchmarks show that our method consistently improves alignment robustness while maintaining overall model utility.

Alignment-Weighted DPO: A principled reasoning approach to improve safety alignment

TL;DR

Abstract

Paper Structure (39 sections, 4 equations, 7 figures, 13 tables)

This paper contains 39 sections, 4 equations, 7 figures, 13 tables.

Introduction
Related Work
LLM safety mechanism
LLM Post-training
Reasoning of Large Language Models.
Reinforcement Learning from Human Feedback and Direct Preference Optimization.
Preliminary Experiments
Method: Teaching Models Why to Say No with Alignment-Weighted DPO
Performance and Error Patterns.
Alignment-Weighted DPO.
Formulation.
Experiments
Baselines & Datasets
Main Result
Comparison with reasoning LLMs
...and 24 more sections

Figures (7)

Figure 1: Heatmap of Probing Accuracy for Original and Pruned Llama-2-7b-Chat and Mistral-7B-Instruct-v0.3 on Alignment and Reasoning Tasks.
Figure 2: AW-DPO Pipeline. Step 1: Generate $k$ candidate responses per prompt using the COT-finetuned LLM, and score their harmfulness on (i) reasoning ($h_{rs}$), (ii) response ($h_{rp}$), and (iii) full answer ($h_f$) using a judge model. Step 2: Select preference pairs $(x_{\text{chosen}}, x_{\text{rejected}})$ where the full harmfulness score difference exceeds threshold $\gamma$. Step 3: Compute alignment weights and train using $L_{\text{AW-DPO}}$.
Figure 3: Plot (a) shows the distribution within unsafe full responses. Plots (b) and (c) present the average safety and utility performance, compared to the corresponding open-source aligned models.
Figure 4: Plot (a) shows the performance improvements of our method on aligned chat models. Plots (b) and (c) show the ablation study comparing the safety and utility performance of standard DPO versus AW-DPO, respectively.
Figure 5: T-SNE Visualization of Embeddings from Attention Head 5 for Alignment and Reasoning Tasks Across All Layers.
...and 2 more figures

Alignment-Weighted DPO: A principled reasoning approach to improve safety alignment

TL;DR

Abstract

Alignment-Weighted DPO: A principled reasoning approach to improve safety alignment

Authors

TL;DR

Abstract

Table of Contents

Figures (7)