Table of Contents
Fetching ...

Simulation-Free Hierarchical Latent Policy Planning for Proactive Dialogues

Tao He, Lizi Liao, Yixin Cao, Yuanxing Liu, Yiheng Sun, Zerui Chen, Ming Liu, Bing Qin

TL;DR

Proactive dialogue tasks require forward-looking policy planning beyond passive instruction-following. The authors introduce LDPP, a simulation-free framework that learns fine-grained latent policies $z_t$ from real dialogue records using a codebook $\\mathcal{Z}$ with $K$ vectors, annotates data with latent labels, and trains an offline hierarchical RL system in the latent space. A P-Former bridges latent policies to the LLM, enabling policy-guided generation without online simulation. Stagewise training—latent policy discovery, policy distillation, and offline RL enhancement—yields a system that outperforms strong baselines across ExTES, ESConv, and P4G, even surpassing a 1.8B-parameter LLM in some cases. This approach reduces reliance on costly simulations and predefined policy sets while delivering robust proactive dialogue capabilities with interpretable latent policies.

Abstract

Recent advancements in proactive dialogues have garnered significant attention, particularly for more complex objectives (e.g. emotion support and persuasion). Unlike traditional task-oriented dialogues, proactive dialogues demand advanced policy planning and adaptability, requiring rich scenarios and comprehensive policy repositories to develop such systems. However, existing approaches tend to rely on Large Language Models (LLMs) for user simulation and online learning, leading to biases that diverge from realistic scenarios and result in suboptimal efficiency. Moreover, these methods depend on manually defined, context-independent, coarse-grained policies, which not only incur high expert costs but also raise concerns regarding their completeness. In our work, we highlight the potential for automatically discovering policies directly from raw, real-world dialogue records. To this end, we introduce a novel dialogue policy planning framework, LDPP. It fully automates the process from mining policies in dialogue records to learning policy planning. Specifically, we employ a variant of the Variational Autoencoder to discover fine-grained policies represented as latent vectors. After automatically annotating the data with these latent policy labels, we propose an Offline Hierarchical Reinforcement Learning (RL) algorithm in the latent space to develop effective policy planning capabilities. Our experiments demonstrate that LDPP outperforms existing methods on two proactive scenarios, even surpassing ChatGPT with only a 1.8-billion-parameter LLM.

Simulation-Free Hierarchical Latent Policy Planning for Proactive Dialogues

TL;DR

Proactive dialogue tasks require forward-looking policy planning beyond passive instruction-following. The authors introduce LDPP, a simulation-free framework that learns fine-grained latent policies from real dialogue records using a codebook with vectors, annotates data with latent labels, and trains an offline hierarchical RL system in the latent space. A P-Former bridges latent policies to the LLM, enabling policy-guided generation without online simulation. Stagewise training—latent policy discovery, policy distillation, and offline RL enhancement—yields a system that outperforms strong baselines across ExTES, ESConv, and P4G, even surpassing a 1.8B-parameter LLM in some cases. This approach reduces reliance on costly simulations and predefined policy sets while delivering robust proactive dialogue capabilities with interpretable latent policies.

Abstract

Recent advancements in proactive dialogues have garnered significant attention, particularly for more complex objectives (e.g. emotion support and persuasion). Unlike traditional task-oriented dialogues, proactive dialogues demand advanced policy planning and adaptability, requiring rich scenarios and comprehensive policy repositories to develop such systems. However, existing approaches tend to rely on Large Language Models (LLMs) for user simulation and online learning, leading to biases that diverge from realistic scenarios and result in suboptimal efficiency. Moreover, these methods depend on manually defined, context-independent, coarse-grained policies, which not only incur high expert costs but also raise concerns regarding their completeness. In our work, we highlight the potential for automatically discovering policies directly from raw, real-world dialogue records. To this end, we introduce a novel dialogue policy planning framework, LDPP. It fully automates the process from mining policies in dialogue records to learning policy planning. Specifically, we employ a variant of the Variational Autoencoder to discover fine-grained policies represented as latent vectors. After automatically annotating the data with these latent policy labels, we propose an Offline Hierarchical Reinforcement Learning (RL) algorithm in the latent space to develop effective policy planning capabilities. Our experiments demonstrate that LDPP outperforms existing methods on two proactive scenarios, even surpassing ChatGPT with only a 1.8-billion-parameter LLM.

Paper Structure

This paper contains 33 sections, 13 equations, 3 figures, 27 tables.

Figures (3)

  • Figure 1: The training process of the LDPP framework. $u$ and $z$ refer to the system utterance and contained latent policy. $h$ and $h'$ denote $t$-th dialogue history $h_t$ and $(t+1)$-th dialogue history $h_{t+1}$, respectively.
  • Figure 2: Performance comparison as the LLM size and LLM series change on ExTES.
  • Figure 3: Visualization of latent policies for utterances belong to top-4 and top-6 most frequently used policies.