Table of Contents
Fetching ...

Refined Direct Preference Optimization with Synthetic Data for Behavioral Alignment of LLMs

Víctor Gallego

TL;DR

This work introduces refined Direct Preference Optimization (rDPO), a framework to align LLMs without human-annotated data by synthesizing data via a teacher LLM, scoring with an external reward model, and distilling with an augmented DPO loss. By decoupling the guidance signal into a teacher and an external evaluator, rDPO achieves improved sample efficiency and robustness to synthetic-noise across safety, role-playing robustness, and sycophancy reduction. The method augments the DPO objective with external scores to form a refined loss $ abla L_{rDPO} $, and demonstrates empirical gains over SFT, SR, dSC, and vanilla DPO. The approach reduces reliance on costly human data while maintaining effective behavioral alignment, with code forthcoming at the authors' GitHub repository.

Abstract

In this paper, we introduce \emph{refined Direct Preference Optimization} (rDPO), a method for improving the behavioral alignment of Large Language Models (LLMs) without the need for human-annotated data. The method involves creating synthetic data using self-critique prompting by a teacher LLM and then utilising a generalized DPO loss function to distil to a student LLM. The loss function incorporates an additional external reward model to improve the quality of synthetic data, making rDPO robust to potential noise in the synthetic dataset. rDPO is shown to be effective in a diverse set of behavioural alignment tasks, such as improved safety, robustness against role-playing, and reduced sycophancy. Code to be released at https://github.com/vicgalle/refined-dpo.

Refined Direct Preference Optimization with Synthetic Data for Behavioral Alignment of LLMs

TL;DR

This work introduces refined Direct Preference Optimization (rDPO), a framework to align LLMs without human-annotated data by synthesizing data via a teacher LLM, scoring with an external reward model, and distilling with an augmented DPO loss. By decoupling the guidance signal into a teacher and an external evaluator, rDPO achieves improved sample efficiency and robustness to synthetic-noise across safety, role-playing robustness, and sycophancy reduction. The method augments the DPO objective with external scores to form a refined loss , and demonstrates empirical gains over SFT, SR, dSC, and vanilla DPO. The approach reduces reliance on costly human data while maintaining effective behavioral alignment, with code forthcoming at the authors' GitHub repository.

Abstract

In this paper, we introduce \emph{refined Direct Preference Optimization} (rDPO), a method for improving the behavioral alignment of Large Language Models (LLMs) without the need for human-annotated data. The method involves creating synthetic data using self-critique prompting by a teacher LLM and then utilising a generalized DPO loss function to distil to a student LLM. The loss function incorporates an additional external reward model to improve the quality of synthetic data, making rDPO robust to potential noise in the synthetic dataset. rDPO is shown to be effective in a diverse set of behavioural alignment tasks, such as improved safety, robustness against role-playing, and reduced sycophancy. Code to be released at https://github.com/vicgalle/refined-dpo.
Paper Structure (21 sections, 2 equations, 2 figures, 5 tables)

This paper contains 21 sections, 2 equations, 2 figures, 5 tables.

Figures (2)

  • Figure 1: Illustrative diagram of the rDPO framework.
  • Figure 2: Prompt example for the robustness against role-playing task.