Table of Contents
Fetching ...

OPTune: Efficient Online Preference Tuning

Lichang Chen, Jiuhai Chen, Chenxi Liu, John Kirchenbauer, Davit Soselia, Chen Zhu, Tom Goldstein, Tianyi Zhou, Heng Huang

TL;DR

A more efficient data exploration strategy for online preference tuning (OPTune) is proposed, which does not rely on human-curated or pre-collected teacher responses but dynamically samples informative responses for on-policy preference alignment.

Abstract

Reinforcement learning with human feedback~(RLHF) is critical for aligning Large Language Models (LLMs) with human preference. Compared to the widely studied offline version of RLHF, \emph{e.g.} direct preference optimization (DPO), recent works have shown that the online variants achieve even better alignment. However, online alignment requires on-the-fly generation of new training data, which is costly, hard to parallelize, and suffers from varying quality and utility. In this paper, we propose a more efficient data exploration strategy for online preference tuning (OPTune), which does not rely on human-curated or pre-collected teacher responses but dynamically samples informative responses for on-policy preference alignment. During data generation, OPTune only selects prompts whose (re)generated responses can potentially provide more informative and higher-quality training signals than the existing responses. In the training objective, OPTune reweights each generated response (pair) by its utility in improving the alignment so that learning can be focused on the most helpful samples. Throughout our evaluations, OPTune'd LLMs maintain the instruction-following benefits provided by standard preference tuning whilst enjoying 1.27-1.56x faster training speed due to the efficient data exploration strategy.

OPTune: Efficient Online Preference Tuning

TL;DR

A more efficient data exploration strategy for online preference tuning (OPTune) is proposed, which does not rely on human-curated or pre-collected teacher responses but dynamically samples informative responses for on-policy preference alignment.

Abstract

Reinforcement learning with human feedback~(RLHF) is critical for aligning Large Language Models (LLMs) with human preference. Compared to the widely studied offline version of RLHF, \emph{e.g.} direct preference optimization (DPO), recent works have shown that the online variants achieve even better alignment. However, online alignment requires on-the-fly generation of new training data, which is costly, hard to parallelize, and suffers from varying quality and utility. In this paper, we propose a more efficient data exploration strategy for online preference tuning (OPTune), which does not rely on human-curated or pre-collected teacher responses but dynamically samples informative responses for on-policy preference alignment. During data generation, OPTune only selects prompts whose (re)generated responses can potentially provide more informative and higher-quality training signals than the existing responses. In the training objective, OPTune reweights each generated response (pair) by its utility in improving the alignment so that learning can be focused on the most helpful samples. Throughout our evaluations, OPTune'd LLMs maintain the instruction-following benefits provided by standard preference tuning whilst enjoying 1.27-1.56x faster training speed due to the efficient data exploration strategy.
Paper Structure (35 sections, 7 equations, 7 figures, 6 tables, 2 algorithms)

This paper contains 35 sections, 7 equations, 7 figures, 6 tables, 2 algorithms.

Figures (7)

  • Figure 1: The pipeline of our OPTune: it only explores the low-reward examples and reuses the high-quality examples, which improves the generation efficiency of the iterative online PT. We also exploit the weighted DPO to enhance the training efficiency by focusing on the high-utility samples. $\pi_t$: the policy in iter $t$. $R$: the reward model. $\rho$: the prompt selection ratio for re-generations.
  • Figure 2: The reward gains brought by two subsets: top-50% ranked prompts and bottom-50% ranked prompts. More gains are achieved from the bottom-50% prompts than the top-50% prompts.
  • Figure 3: OPTune (wDPO loss): Y-axis denotes the win score against Zephyr-7B-beta model. Rdm_$\rho$: random selection ratio (all striped bars). Under the same selection ratio, OPTune'd models could perform better than the models tuned with random-selection strategy. The policies in prompt selection $\rho=0.5$ and $\rho=0.7$ could be comparable with the policies in $\rho=1$ while enjoying 30% to 50% generation efficiency, which proves the effectiveness of OPTune.
  • Figure 4: OPTune (DPO loss): Even in the special case, i.e., DPO loss is a special case of our proposed wDPO, we could still have the conclusion that OPTune with $\rho=0.7$ could maintain the performance but save 30% generation cost. Rdm_$\rho$: random selection ratio.
  • Figure 5: The win score vs. training time on different prompt selection ratios. By re-generating the responses on only half of the prompts, OPTune could achieve the win score on par with the vanilla online version ($\rho=1$).
  • ...and 2 more figures