Direct Preference Optimization: Your Language Model is Secretly a Reward Model
Authors
Rafael Rafailov, Archit Sharma, Eric Mitchell, Stefano Ermon, Christopher D. Manning, Chelsea Finn
Abstract
While large-scale unsupervised language models (LMs) learn broad world
knowledge and some reasoning skills, achieving precise control of their
behavior is difficult due to the completely unsupervised nature of their
training. Existing methods for gaining such steerability collect human labels
of the relative quality of model generations and fine-tune the unsupervised LM
to align with these preferences, often with reinforcement learning from human
feedback (RLHF). However, RLHF is a complex and often unstable procedure, first
fitting a reward model that reflects the human preferences, and then
fine-tuning the large unsupervised LM using reinforcement learning to maximize
this estimated reward without drifting too far from the original model. In this
paper we introduce a new parameterization of the reward model in RLHF that
enables extraction of the corresponding optimal policy in closed form, allowing
us to solve the standard RLHF problem with only a simple classification loss.
The resulting algorithm, which we call Direct Preference Optimization (DPO), is
stable, performant, and computationally lightweight, eliminating the need for
sampling from the LM during fine-tuning or performing significant
hyperparameter tuning. Our experiments show that DPO can fine-tune LMs to align
with human preferences as well as or better than existing methods. Notably,
fine-tuning with DPO exceeds PPO-based RLHF in ability to control sentiment of
generations, and matches or improves response quality in summarization and
single-turn dialogue while being substantially simpler to implement and train.