Table of Contents
Fetching ...

Implicit Turn-Wise Policy Optimization for Proactive User-LLM Interaction

Haoyu Wang, Yuxin Chen, Liang Luo, Buyun Zhang, Ellie Dingqiao Wen, Pan Li

Abstract

Multi-turn human-AI collaboration is fundamental to deploying interactive services such as adaptive tutoring, conversational recommendation, and professional consultation. However, optimizing these interactions via reinforcement learning is hindered by the sparsity of verifiable intermediate rewards and the high stochasticity of user responses. To address these challenges, we introduce Implicit Turn-wise Policy Optimization (ITPO). ITPO leverages an implicit process reward model to derive fine-grained, turn-wise process rewards from sparse outcome signals. Unlike volatile token-level rewards, these turn-level signals exhibit superior robustness and may utilize a normalization mechanism to further enhance training stability. We evaluate ITPO across three representative multi-turn collaborative tasks: math tutoring, document writing, and medical recommendation. Empirical results demonstrate that ITPO, when combined with PPO, GRPO, or RLOO, consistently achieves improved convergence than existing baselines. Elaborate trajectory analysis confirms that ITPO infers turn-wise preferences that are semantically aligned with human judgment. Code is publicly available at https://github.com/Graph-COM/ITPO.

Implicit Turn-Wise Policy Optimization for Proactive User-LLM Interaction

Abstract

Multi-turn human-AI collaboration is fundamental to deploying interactive services such as adaptive tutoring, conversational recommendation, and professional consultation. However, optimizing these interactions via reinforcement learning is hindered by the sparsity of verifiable intermediate rewards and the high stochasticity of user responses. To address these challenges, we introduce Implicit Turn-wise Policy Optimization (ITPO). ITPO leverages an implicit process reward model to derive fine-grained, turn-wise process rewards from sparse outcome signals. Unlike volatile token-level rewards, these turn-level signals exhibit superior robustness and may utilize a normalization mechanism to further enhance training stability. We evaluate ITPO across three representative multi-turn collaborative tasks: math tutoring, document writing, and medical recommendation. Empirical results demonstrate that ITPO, when combined with PPO, GRPO, or RLOO, consistently achieves improved convergence than existing baselines. Elaborate trajectory analysis confirms that ITPO infers turn-wise preferences that are semantically aligned with human judgment. Code is publicly available at https://github.com/Graph-COM/ITPO.
Paper Structure (31 sections, 8 equations, 23 figures, 2 tables, 1 algorithm)

This paper contains 31 sections, 8 equations, 23 figures, 2 tables, 1 algorithm.

Figures (23)

  • Figure 1: The overall framework of ITPO for User-LLM interactions with a real case example in Medical Recommendation. The optimization loop consists: of (1) Multi-turn online roll-out, (2) implicit PRM update with outcome rewards and token-level process reward estimation, (3) Turn-wise process reward aggregation and normalization, (4) Advantage estimation and policy optimization.
  • Figure 2: Dynamics of the relative implicit rewards on a real Math Tutoring case. Turn-wise rewards of Response Turn #1 assigned by ITPO consistently rank the highest from step $20$, meanwhile reflecting the importance of resolving ambiguity early in the conversation. By contrast, the token-level rewards show high variance, with the signals for identical tokens fluctuating across training steps (red boxes).
  • Figure 3: Spearman correlation of the token-level and turn-wise implicit rewards against their respective converged baselines (the averaged rewards over the final $70$ optimization steps) on Math Tutoring. Turn-wise preference stabilizes quickly, while token-level rankings show slower convergence due to optimization difficulty.
  • Figure 4: The training curve of reward attribution methods with the RLOO advantage estimator.
  • Figure 5: Kendall-$\tau$ between PRM and outcome reward w/ RLOO.
  • ...and 18 more figures