Dialogue Model Optimization via Agent Game and Adaptive Tree-based GRPO

Kun Peng; Conghui Tan; Yu Liu; Guohua Tang; Zhongqian Sun; Wei Yang; Zining Zhu; Lei Jiang; Yanbing Liu; Hao Peng

Dialogue Model Optimization via Agent Game and Adaptive Tree-based GRPO

Kun Peng, Conghui Tan, Yu Liu, Guohua Tang, Zhongqian Sun, Wei Yang, Zining Zhu, Lei Jiang, Yanbing Liu, Hao Peng

TL;DR

This work tackles the challenges of online personalization and long-horizon optimization in open-ended dialogue by introducing a two-agent game between a user agent and a dialogue agent. The user agent leverages style mimicry and an active termination mechanism to create dynamic, data-free interaction environments, while the dialogue agent is trained with Adaptive Tree-based GRPO (AT-GRPO) to capture long-term dialogue value through adaptive, tree-structured rollouts. AT-GRPO reduces the rollout budget from exponential to polynomial by using adaptive observation ranges and a single-trajectory expansion strategy, with the objective function blending immediate and long-term rewards via a normalized advantage and KL regularization. Empirical results across NPC-Chat, LCCC, and DailyDialog demonstrate superior performance, sample efficiency, and robustness, including strong generalization to out-of-domain data and meaningful ablations that validate the contributions of style mimicry, termination dynamics, adaptive observation, and tree-based reward aggregation.

Abstract

Open-ended dialogue agents aim to deliver engaging, personalized interactions by adapting to users' traits, but existing methods face critical limitations: over-reliance on pre-collected user data, and short-horizon biases in reinforcement learning (RL) that neglect long-term dialogue value. To address these, we propose a novel long-horizon RL framework integrating online personalization with Adaptive Tree-based Group Relative Policy Optimization (AT-GRPO). Adopting a two-agent game paradigm, a user agent constructs dynamic environments via style mimicry (learning user-specific conversational traits) and active termination (predicting turn-level termination probabilities as immediate rewards), forming an iterative cycle that drives the dialogue agent to deepen interest exploration. AT-GRPO reinterprets dialogue trajectories as trees and introduces adaptive observation ranges. Unlike full tree expansion that incurs exponential overhead, it limits each node to aggregate rewards from a stage-aware range: larger ranges support early-stage topic exploration, while smaller ranges facilitate late-stage dialogue maintenance. This design reduces rollout budgets from exponential to polynomial in the dialogue length, while preserving long-term reward capture. Extensive experiments show our framework's superior performance, sample efficiency, and robustness.

Dialogue Model Optimization via Agent Game and Adaptive Tree-based GRPO

TL;DR

Abstract

Paper Structure (37 sections, 15 equations, 7 figures, 10 tables, 1 algorithm)

This paper contains 37 sections, 15 equations, 7 figures, 10 tables, 1 algorithm.

Introduction
Related Work
Personalized Dialogue LLM
Tree-Search Enhanced RL for LLM
Proposed Methodology
User Agent
Style Mimicry and Termination Preference Learning
Explicit Probability Propagation and Dynamic Threshold Adjustment
Dialogue Agent and Adaptive Tree-based GRPO (AT-GRPO)
Adaptive Observation Range
Single-trajectory Tree Expansion
Computational Complexity
Objective Function
Experiments
Settings
...and 22 more sections

Figures (7)

Figure 1: Limitations of existing methods vs. our framework.
Figure 2: The overall framework operates through an iterative closed loop: the user agent provides real-time feedback (e.g., termination probability as reward signals) based on the dialogue agent’s responses, and the dialogue agent is trained via AT-GRPO to balance immediate interaction quality and long-term conversational value.
Figure 3: User agent evaluation. (Flu., Con., Div. and Nat. denote Fluency, Consistency, Diversity, and Naturalness, respectively.)
Figure 4: Hyper-parameter study. All experiments are conducted using Qwen2.5-14B as the base model on the NPC-Chat dataset.
Figure 5: Training curves: $Avg.r$ (left) and $Avg.L$ (right).
...and 2 more figures

Dialogue Model Optimization via Agent Game and Adaptive Tree-based GRPO

TL;DR

Abstract

Dialogue Model Optimization via Agent Game and Adaptive Tree-based GRPO

Authors

TL;DR

Abstract

Table of Contents

Figures (7)