Table of Contents
Fetching ...

MAPO: Mixed Advantage Policy Optimization for Long-Horizon Multi-Turn Dialogue

Naifan Zhang, Ruihan Sun, Jinwei Su, Hengjie Yang, Zhengyuan Pan, Zhaohan Chen, Xiaofan Zhang

TL;DR

This work proposes a critic-free and efficient RL algorithm named MAPO that leverages dense process feedback from a judge model and propagates long-horizon effects through Monte Carlo returns and demonstrates that dense process supervision combined with mixed-level normalization enables effective and scalable RL for subjective, open-ended multi-turn dialogue.

Abstract

Subjective multi-turn dialogue tasks, such as emotional support, require conversational policies that adapt to evolving user states and optimize long-horizon interaction quality. However, reinforcement learning (RL) for such settings remains challenging due to the absence of reliable process supervision. Outcome-only training collapses credit assignment across turns into a single trajectory-level reward, while naïve turn-level group sampling incurs prohibitive rollout costs in interactive environments. We propose a critic-free and efficient RL algorithm named MAPO that leverages dense process feedback from a judge model and propagates long-horizon effects through Monte Carlo returns. To stabilize optimization, we introduce a mixed advantage estimator that combines turn-level normalization with batch-level normalization, enabling fine-grained yet scalable credit assignment. Across multiple subjective dialogue benchmarks, including EMPA, EmoBench, and EQ-Bench, and model scales ranging from 7B to 32B, our method consistently improves both training stability and final performance over outcome-only GRPO and single-level normalization baselines. On EMPA, we improve rates by up to 9 points and increase dialogue scores by as much as +43.2 over the 7B base model. Despite training only on EMPA-style environments, our approach generalizes well, yielding consistent improvements on unseen emotional-intelligence benchmarks, including up to +4 points on EmoBench and +3.5 on EQ-Bench. Together, these results demonstrate that dense process supervision combined with mixed-level normalization enables effective and scalable RL for subjective, open-ended multi-turn dialogue.

MAPO: Mixed Advantage Policy Optimization for Long-Horizon Multi-Turn Dialogue

TL;DR

This work proposes a critic-free and efficient RL algorithm named MAPO that leverages dense process feedback from a judge model and propagates long-horizon effects through Monte Carlo returns and demonstrates that dense process supervision combined with mixed-level normalization enables effective and scalable RL for subjective, open-ended multi-turn dialogue.

Abstract

Subjective multi-turn dialogue tasks, such as emotional support, require conversational policies that adapt to evolving user states and optimize long-horizon interaction quality. However, reinforcement learning (RL) for such settings remains challenging due to the absence of reliable process supervision. Outcome-only training collapses credit assignment across turns into a single trajectory-level reward, while naïve turn-level group sampling incurs prohibitive rollout costs in interactive environments. We propose a critic-free and efficient RL algorithm named MAPO that leverages dense process feedback from a judge model and propagates long-horizon effects through Monte Carlo returns. To stabilize optimization, we introduce a mixed advantage estimator that combines turn-level normalization with batch-level normalization, enabling fine-grained yet scalable credit assignment. Across multiple subjective dialogue benchmarks, including EMPA, EmoBench, and EQ-Bench, and model scales ranging from 7B to 32B, our method consistently improves both training stability and final performance over outcome-only GRPO and single-level normalization baselines. On EMPA, we improve rates by up to 9 points and increase dialogue scores by as much as +43.2 over the 7B base model. Despite training only on EMPA-style environments, our approach generalizes well, yielding consistent improvements on unseen emotional-intelligence benchmarks, including up to +4 points on EmoBench and +3.5 on EQ-Bench. Together, these results demonstrate that dense process supervision combined with mixed-level normalization enables effective and scalable RL for subjective, open-ended multi-turn dialogue.
Paper Structure (31 sections, 14 equations, 6 figures, 1 table)

This paper contains 31 sections, 14 equations, 6 figures, 1 table.

Figures (6)

  • Figure 1: Framework of the MAPO. The policy model interacts with EMPAzhang2026empaevaluatingpersonaalignedempathy to collect multi-turn trajectories, which is then optimized via Mixed-Advantage. Top: EMPA serves as a simulated multi-turn interaction environment (detailed in Sec.\ref{['sec:env']}). Bottom: Our policy optimization pipeline, where the mixed-advantage estimator is introduced in detail in Sec.\ref{['sec:mix_adv']}
  • Figure 2: Overall algorithm. Given an initial prompt, we sample $k$ trajectories from the current policy, each consisting of $m$ samples. The turn-level advantage is computed by normalizing returns across samples at the same turn. The batch-level advantage is computed by normalizing rewards over all $k \times m$ samples in the batch. The final advantage is a convex combination of these two terms, balancing fine-grained credit assignment with global batch-level optimization.
  • Figure 3: Distribution of Monte Carlo returns and immediate rewards across dialogue turns at specific training step. (a) Monte Carlo returns exhibit a clear positive correlation with the turn index; (b) In contrast, immediate rewards show no discernible trend across turns.
  • Figure 4: Success rates (%) of Base, GRPO, and MAPO evaluated on samples dominated by different emotional needs. MAPO consistently achieves the highest success rates across all dimensions and scales.
  • Figure 5: Empathy alignment scores across various dimensions. MAPO consistently outperforms GRPO across all dimensions and model scales, and achieves larger alignment gains over Base, particularly for smaller models where Base exhibits negative alignment.
  • ...and 1 more figures