Direct Alignment of Language Models via Quality-Aware Self-Refinement

Runsheng Yu; Yong Wang; Xiaoqi Jiao; Youzhi Zhang; James T. Kwok

Direct Alignment of Language Models via Quality-Aware Self-Refinement

Runsheng Yu, Yong Wang, Xiaoqi Jiao, Youzhi Zhang, James T. Kwok

TL;DR

This work addresses the misalignment risk in Direct Policy Optimization (DPO) by incorporating a self-refinement mechanism that leverages intrinsic LLM knowledge. It introduces a refinement function $\Delta_\pi$ implemented via prompting to gauge the relative quality of positive and negative responses, and integrates it into DPO (Sr-DPO) and IPO (Sr-IPO) to self-adjust the loss. The authors formalize the refinement properties, show how $\Delta_\pi$ can be computed through prompt augmentation with $p\oplus x$, and demonstrate that Sr-DPO and Sr-IPO improve evaluation metrics across MT-Bench, Vicuna-Bench, and Open-LLM leader-board, with higher correlation to GPT-4 scores and acceptable training time. This approach advances practical, data-efficient alignment by exploiting the policy’s own knowledge, offering robust improvements in preference-based fine-tuning with potential broad applicability in safe AI deployment.

Abstract

Reinforcement Learning from Human Feedback (RLHF) has been commonly used to align the behaviors of Large Language Models (LLMs) with human preferences. Recently, a popular alternative is Direct Policy Optimization (DPO), which replaces an LLM-based reward model with the policy itself, thus obviating the need for extra memory and training time to learn the reward model. However, DPO does not consider the relative qualities of the positive and negative responses, and can lead to sub-optimal training outcomes. To alleviate this problem, we investigate the use of intrinsic knowledge within the on-the-fly fine-tuning LLM to obtain relative qualities and help to refine the loss function. Specifically, we leverage the knowledge of the LLM to design a refinement function to estimate the quality of both the positive and negative responses. We show that the constructed refinement function can help self-refine the loss function under mild assumptions. The refinement function is integrated into DPO and its variant Identity Policy Optimization (IPO). Experiments across various evaluators indicate that they can improve the performance of the fine-tuned models over DPO and IPO.

Direct Alignment of Language Models via Quality-Aware Self-Refinement

TL;DR

This work addresses the misalignment risk in Direct Policy Optimization (DPO) by incorporating a self-refinement mechanism that leverages intrinsic LLM knowledge. It introduces a refinement function

implemented via prompting to gauge the relative quality of positive and negative responses, and integrates it into DPO (Sr-DPO) and IPO (Sr-IPO) to self-adjust the loss. The authors formalize the refinement properties, show how

can be computed through prompt augmentation with

, and demonstrate that Sr-DPO and Sr-IPO improve evaluation metrics across MT-Bench, Vicuna-Bench, and Open-LLM leader-board, with higher correlation to GPT-4 scores and acceptable training time. This approach advances practical, data-efficient alignment by exploiting the policy’s own knowledge, offering robust improvements in preference-based fine-tuning with potential broad applicability in safe AI deployment.

Abstract

Paper Structure (27 sections, 3 theorems, 30 equations, 3 figures, 9 tables, 2 algorithms)

This paper contains 27 sections, 3 theorems, 30 equations, 3 figures, 9 tables, 2 algorithms.

Introduction
Preliminaries
Classical RLHF with Bradley-Terry Reward Model
Direct Preference Optimization (DPO)
Identity Policy Optimisation (IPO)
Self-Alignment
Proposed Method
Limitation of Maximizing Bradley-Terry Preference
Refining the Reward Difference between Positive and Negative Responses
Intuition
Implementing $\Delta$ via Prompting
Integration with DPO and IPO
Integration with DPO
Integration with IPO
Experiments
...and 12 more sections

Key Result

Proposition 3.3

With Assumptions assu:improve and assu:consistency, we have (i) $\Delta_\pi(y^-, y^+;x)= \Delta_\pi(y^-, y^*;x) - \Delta_\pi(y^+, y^*;x)$, where $y^*$ is the optimal $y$ for the given $x$; (ii) For any tuple $(x,y^+,y^-)$, $r^*(y^+|x) >r^*(y^-|x) \Leftrightarrow \Delta_\pi(y^+, y^*;x) < \Delta_\pi(y

Figures (3)

Figure 1: Average marginal and accuracy on HH-RLHF dataset with different numbers of training tuples.
Figure 2: Win rates of Sr-DPO (vs DPO) and Sr-IPO (vs IPO) w.r.t. $\lambda$ and number of training tuples on Vicuna-Bench.
Figure 3: Performance on Open-LLM leader-board with different numbers of training tuples.

Theorems & Definitions (5)

Proposition 3.3
Corollary 3.3.1
Proposition 3.4
proof
proof

Direct Alignment of Language Models via Quality-Aware Self-Refinement

TL;DR

Abstract

Direct Alignment of Language Models via Quality-Aware Self-Refinement

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (3)

Theorems & Definitions (5)