Table of Contents
Fetching ...

GDPO: Learning to Directly Align Language Models with Diversity Using GFlowNets

Oh Joon Kwon, Daiki E. Matsunaga, Kee-Eung Kim

TL;DR

This work proposes a practical application of a diversity-seeking RL algorithm called GFlowNet-DPO (GDPO) in an offline preference alignment setting to curtail such challenges of Direct Preference Optimization.

Abstract

A critical component of the current generation of language models is preference alignment, which aims to precisely control the model's behavior to meet human needs and values. The most notable among such methods is Reinforcement Learning with Human Feedback (RLHF) and its offline variant Direct Preference Optimization (DPO), both of which seek to maximize a reward model based on human preferences. In particular, DPO derives reward signals directly from the offline preference data, but in doing so overfits the reward signals and generates suboptimal responses that may contain human biases in the dataset. In this work, we propose a practical application of a diversity-seeking RL algorithm called GFlowNet-DPO (GDPO) in an offline preference alignment setting to curtail such challenges. Empirical results show GDPO can generate far more diverse responses than the baseline methods that are still relatively aligned with human values in dialog generation and summarization tasks.

GDPO: Learning to Directly Align Language Models with Diversity Using GFlowNets

TL;DR

This work proposes a practical application of a diversity-seeking RL algorithm called GFlowNet-DPO (GDPO) in an offline preference alignment setting to curtail such challenges of Direct Preference Optimization.

Abstract

A critical component of the current generation of language models is preference alignment, which aims to precisely control the model's behavior to meet human needs and values. The most notable among such methods is Reinforcement Learning with Human Feedback (RLHF) and its offline variant Direct Preference Optimization (DPO), both of which seek to maximize a reward model based on human preferences. In particular, DPO derives reward signals directly from the offline preference data, but in doing so overfits the reward signals and generates suboptimal responses that may contain human biases in the dataset. In this work, we propose a practical application of a diversity-seeking RL algorithm called GFlowNet-DPO (GDPO) in an offline preference alignment setting to curtail such challenges. Empirical results show GDPO can generate far more diverse responses than the baseline methods that are still relatively aligned with human values in dialog generation and summarization tasks.

Paper Structure

This paper contains 37 sections, 8 equations, 3 figures, 6 tables, 1 algorithm.

Figures (3)

  • Figure 1: Win percentage versus diversity scatter plot for Anthropic HH dataset with sampling temperature 1.0. Refer to the first figure for legends. The horizontal bars show the standard error for the win rate. We do not provide the error bar for the diversity since the error is insignificant and similar throughout different methods.
  • Figure 2: Win percentage versus diversity scatter plot for TLDR dataset with sampling temperature 1.0. Refer to the first figure for legends. The horizontal bars show the standard error for the win rate. We do not provide the error bar for the diversity since the error is insignificant and similar throughout different methods. We provide win rates for two different GPT-4 evaluation prompts, namely simple (S) and concise (C).
  • Figure : Pseudocode for DB objective.