Refined Direct Preference Optimization with Synthetic Data for Behavioral Alignment of LLMs
Víctor Gallego
TL;DR
This work introduces refined Direct Preference Optimization (rDPO), a framework to align LLMs without human-annotated data by synthesizing data via a teacher LLM, scoring with an external reward model, and distilling with an augmented DPO loss. By decoupling the guidance signal into a teacher and an external evaluator, rDPO achieves improved sample efficiency and robustness to synthetic-noise across safety, role-playing robustness, and sycophancy reduction. The method augments the DPO objective with external scores to form a refined loss $ abla L_{rDPO} $, and demonstrates empirical gains over SFT, SR, dSC, and vanilla DPO. The approach reduces reliance on costly human data while maintaining effective behavioral alignment, with code forthcoming at the authors' GitHub repository.
Abstract
In this paper, we introduce \emph{refined Direct Preference Optimization} (rDPO), a method for improving the behavioral alignment of Large Language Models (LLMs) without the need for human-annotated data. The method involves creating synthetic data using self-critique prompting by a teacher LLM and then utilising a generalized DPO loss function to distil to a student LLM. The loss function incorporates an additional external reward model to improve the quality of synthetic data, making rDPO robust to potential noise in the synthetic dataset. rDPO is shown to be effective in a diverse set of behavioural alignment tasks, such as improved safety, robustness against role-playing, and reduced sycophancy. Code to be released at https://github.com/vicgalle/refined-dpo.
