Table of Contents
Fetching ...

Regressing the Relative Future: Efficient Policy Optimization for Multi-turn RLHF

Zhaolin Gao, Wenhao Zhan, Jonathan D. Chang, Gokul Swamy, Kianté Brantley, Jason D. Lee, Wen Sun

TL;DR

The paper tackles the challenge of multi-turn RLHF by addressing covariate shift that arises when converting multi-turn dialogues into last-turn problems. It introduces REFUEL, a regression-based, on-policy optimization method that uses pairwise, on-policy rollouts from shared prefixes to estimate relative future Q-values without a separate critic, and it proves performance guarantees under relaxed completeness assumptions. The authors provide empirical evidence that REFUEL outperforms strong baselines like DPO and REBEL in multi-turn dialogue simulations, with smaller models matching or exceeding larger ones on longer conversations. Overall, REFUEL offers a scalable, principled approach for improving long-horizon dialogue with RLHF, supported by both theory and experiments.

Abstract

Large Language Models (LLMs) have achieved remarkable success at tasks like summarization that involve a single turn of interaction. However, they can still struggle with multi-turn tasks like dialogue that require long-term planning. Previous works on multi-turn dialogue extend single-turn reinforcement learning from human feedback (RLHF) methods to the multi-turn setting by treating all prior dialogue turns as a long context. Such approaches suffer from covariate shift: the conversations in the training set have previous turns generated by some reference policy, which means that low training error may not necessarily correspond to good performance when the learner is actually in the conversation loop. In response, we introduce REgressing the RELative FUture (REFUEL), an efficient policy optimization approach designed to address multi-turn RLHF in LLMs. REFUEL employs a single model to estimate $Q$-values and trains on self-generated data, addressing the covariate shift issue. REFUEL frames the multi-turn RLHF problem as a sequence of regression tasks on iteratively collected datasets, enabling ease of implementation. Theoretically, we prove that REFUEL can match the performance of any policy covered by the training set. Empirically, we evaluate our algorithm by using Llama-3.1-70B-it to simulate a user in conversation with our model. REFUEL consistently outperforms state-of-the-art methods such as DPO and REBEL across various settings. Furthermore, despite having only 8 billion parameters, Llama-3-8B-it fine-tuned with REFUEL outperforms Llama-3.1-70B-it on long multi-turn dialogues. Implementation of REFUEL can be found at https://github.com/ZhaolinGao/REFUEL/, and models trained by REFUEL can be found at https://huggingface.co/Cornell-AGI.

Regressing the Relative Future: Efficient Policy Optimization for Multi-turn RLHF

TL;DR

The paper tackles the challenge of multi-turn RLHF by addressing covariate shift that arises when converting multi-turn dialogues into last-turn problems. It introduces REFUEL, a regression-based, on-policy optimization method that uses pairwise, on-policy rollouts from shared prefixes to estimate relative future Q-values without a separate critic, and it proves performance guarantees under relaxed completeness assumptions. The authors provide empirical evidence that REFUEL outperforms strong baselines like DPO and REBEL in multi-turn dialogue simulations, with smaller models matching or exceeding larger ones on longer conversations. Overall, REFUEL offers a scalable, principled approach for improving long-horizon dialogue with RLHF, supported by both theory and experiments.

Abstract

Large Language Models (LLMs) have achieved remarkable success at tasks like summarization that involve a single turn of interaction. However, they can still struggle with multi-turn tasks like dialogue that require long-term planning. Previous works on multi-turn dialogue extend single-turn reinforcement learning from human feedback (RLHF) methods to the multi-turn setting by treating all prior dialogue turns as a long context. Such approaches suffer from covariate shift: the conversations in the training set have previous turns generated by some reference policy, which means that low training error may not necessarily correspond to good performance when the learner is actually in the conversation loop. In response, we introduce REgressing the RELative FUture (REFUEL), an efficient policy optimization approach designed to address multi-turn RLHF in LLMs. REFUEL employs a single model to estimate -values and trains on self-generated data, addressing the covariate shift issue. REFUEL frames the multi-turn RLHF problem as a sequence of regression tasks on iteratively collected datasets, enabling ease of implementation. Theoretically, we prove that REFUEL can match the performance of any policy covered by the training set. Empirically, we evaluate our algorithm by using Llama-3.1-70B-it to simulate a user in conversation with our model. REFUEL consistently outperforms state-of-the-art methods such as DPO and REBEL across various settings. Furthermore, despite having only 8 billion parameters, Llama-3-8B-it fine-tuned with REFUEL outperforms Llama-3.1-70B-it on long multi-turn dialogues. Implementation of REFUEL can be found at https://github.com/ZhaolinGao/REFUEL/, and models trained by REFUEL can be found at https://huggingface.co/Cornell-AGI.
Paper Structure (34 sections, 6 theorems, 48 equations, 2 figures, 6 tables, 1 algorithm)

This paper contains 34 sections, 6 theorems, 48 equations, 2 figures, 6 tables, 1 algorithm.

Key Result

Theorem 1

Under ass:regression and ass:coverage, if we initialize $\pi_1$ to be a uniformly random policy and choose an appropriate $\eta$, after $T$ iterations, there must exist a policy $\pi_t$ where $t\in[T]$ such that for all comparator policy $\pi^*$,

Figures (2)

  • Figure 1: We present Refuel: a simple, regression based approach for multi-turn RLHF. Traditional single-turn RLHF methods suffer from covariate shift as they train on histories generated by the base policy rather than current policy. Refuel eliminates the covariate shift by iteratively generate on-policy datasets, aligning the training and testing distributions. Refuel performs better at later turns compared to the baseline methods in terms of winrate (which is computed against the base policy, Llama-3-8B-it, using GPT4).
  • Figure 2: Plot of Winrate vs. Step.Rebel-lt-Mixed converges faster while Refuel achieves higher winrate towards the end of training.

Theorems & Definitions (9)

  • Theorem 1
  • Definition 1
  • Remark 1
  • Proposition 1
  • Proposition 2
  • Claim 1
  • Lemma 1
  • Lemma 2
  • Lemma 3