Table of Contents
Fetching ...

2D-DPO: Scaling Direct Preference Optimization with 2-Dimensional Supervision

Shilong Li, Yancheng He, Hui Huang, Xingyuan Bu, Jiaheng Liu, Hangyu Guo, Weixun Wang, Jihao Gu, Wenbo Su, Bo Zheng

TL;DR

This work proposes to extend the preference of DPO to two dimensions: segments and aspects, and develops a 2D-DPO framework, decomposing the overall objective into multi-segment and multi-aspect objectives.

Abstract

Recent advancements in Direct Preference Optimization (DPO) have significantly enhanced the alignment of Large Language Models (LLMs) with human preferences, owing to its simplicity and effectiveness. However, existing methods typically optimize a scalar score or ranking reward, thereby overlooking the multi-dimensional nature of human preferences. In this work, we propose to extend the preference of DPO to two dimensions: segments and aspects. We first introduce a 2D supervision dataset called HelpSteer-2D. For the segment dimension, we divide the response into sentences and assign scores to each segment. For the aspect dimension, we meticulously design several criteria covering the response quality rubrics. With the 2-dimensional signals as feedback, we develop a 2D-DPO framework, decomposing the overall objective into multi-segment and multi-aspect objectives. Extensive experiments on popular benchmarks demonstrate that 2D-DPO performs better than methods that optimize for scalar or 1-dimensional preferences.

2D-DPO: Scaling Direct Preference Optimization with 2-Dimensional Supervision

TL;DR

This work proposes to extend the preference of DPO to two dimensions: segments and aspects, and develops a 2D-DPO framework, decomposing the overall objective into multi-segment and multi-aspect objectives.

Abstract

Recent advancements in Direct Preference Optimization (DPO) have significantly enhanced the alignment of Large Language Models (LLMs) with human preferences, owing to its simplicity and effectiveness. However, existing methods typically optimize a scalar score or ranking reward, thereby overlooking the multi-dimensional nature of human preferences. In this work, we propose to extend the preference of DPO to two dimensions: segments and aspects. We first introduce a 2D supervision dataset called HelpSteer-2D. For the segment dimension, we divide the response into sentences and assign scores to each segment. For the aspect dimension, we meticulously design several criteria covering the response quality rubrics. With the 2-dimensional signals as feedback, we develop a 2D-DPO framework, decomposing the overall objective into multi-segment and multi-aspect objectives. Extensive experiments on popular benchmarks demonstrate that 2D-DPO performs better than methods that optimize for scalar or 1-dimensional preferences.

Paper Structure

This paper contains 43 sections, 25 equations, 15 figures, 5 tables.

Figures (15)

  • Figure 1: An illustrative comparison between vanilla DPO and 2D-DPO.
  • Figure 2: Illustration of our proposed 2D-DPO. Firstly, we develop principles for preference annotation on different aspects, and collect scores across different segments and aspects for pairwised responses, leading to 2-dimensional signals. Secondly, we apply 2D-DPO on the constructed signals with decomposed training objective.
  • Figure 3: The relative performance on different aspects of different alignment methods.
  • Figure 4: The trends in reward scores and accuracy over training steps across DPO, TDPO, 1D-DPO, and 2D-DPO. (a) Rewards of preferred (solid lines) and dispreferred (dashed lines) responses. (b) Reward accuracy compared with preference annotation.
  • Figure 5: The trends in sequential KL divergence between the policy model and the reference model over training steps across DPO, TDPO, 1D-DPO, and 2D-DPO. (a) KL divergence for preferred responses. (b) KL divergence for dispreferred responses.
  • ...and 10 more figures

Theorems & Definitions (1)

  • Definition 1