Aligning LLMs Toward Multi-Turn Conversational Outcomes Using Iterative PPO

Daniel R. Jiang; Jalaj Bhandari; Yukai Yang; Rémi Munos; Tyler Lu

Aligning LLMs Toward Multi-Turn Conversational Outcomes Using Iterative PPO

Daniel R. Jiang, Jalaj Bhandari, Yukai Yang, Rémi Munos, Tyler Lu

TL;DR

The paper tackles the challenge of optimizing LLMs for multi-turn conversational outcomes with sparse, long-horizon rewards. It introduces a formal reduction that treats the multi-turn problem as a sequence of single-turn RLHF-style problems by using a learned multi-turn $Q^ ext{\pi}$ as the single-turn reward, and proves that solving the single-turn problem with token-level PPO corresponds to a policy improvement in the multi-turn setting. This insight motivates Iterative PPO, a batch online policy iteration that alternates between fitting $Q^ ext{\pi}$ from logged trajectories and applying standard PPO to improve the policy. The approach leverages mature single-turn RLHF tools to achieve stability and practicality, occupying a middle ground between online and offline methods. Conceptually simple and data-friendly, Iterative PPO enables continual learning from real interactions without requiring a simulator, with broad applicability to goal-directed dialogue in e-commerce, customer service, and related domains.

Abstract

Optimizing large language models (LLMs) for multi-turn conversational outcomes remains a significant challenge, especially in goal-oriented settings like AI marketing or sales agents who facilitate transactions via messaging platforms. The difficulty stems from sparse, long-horizon rewards and the discrepancy between response-level planning and token-level generation. In this technical note, we propose a formal reduction of the multi-turn RL problem into a sequence of single-turn RLHF-style problems. This is achieved by setting a learned multi-turn Q-function as the reward model for the single-turn problem. We demonstrate and prove a key insight: solving this single-turn RL problem with standard token-level PPO is equivalent to a policy improvement step within the multi-turn problem. This insight naturally leads to Iterative PPO, a batch online policy iteration algorithm that alternates between fitting Q-functions from logged conversation trajectories and improving the policy. A major practical advantage is that Iterative PPO directly leverages stable, off-the-shelf single-turn RLHF tools, making it straightforward to implement. Our method occupies a middle ground between fully online and fully offline approaches, retaining the adaptability of online updates while gaining the stability benefits of offline training.

Aligning LLMs Toward Multi-Turn Conversational Outcomes Using Iterative PPO

TL;DR

Abstract

Aligning LLMs Toward Multi-Turn Conversational Outcomes Using Iterative PPO

TL;DR

Abstract

Paper Structure

Table of Contents

Key Result

Figures (2)

Theorems & Definitions (5)