Table of Contents
Fetching ...

Universal Post-Processing Networks for Joint Optimization of Modules in Task-Oriented Dialogue Systems

Atsumoto Ohashi, Ryuichiro Higashinaka

TL;DR

This work targets improving task-oriented dialogue systems by enabling joint optimization of all module outputs through a universal post-processing network (UniPPN). UniPPN treats post-processing as a sequence-transformation task controlled by a module-aware language model and trained with a module-level MDP, combining imitation learning for bootstrapping with reinforcement learning via PPO to yield fine-grained credit assignment. Across MultiWOZ-based simulations and human studies, UniPPN outperforms traditional disjoint post-processing approaches (BinPPN and GenPPN), achieving higher task success and often fewer dialogue turns while maintaining user satisfaction. The approach broadens applicability to both pipeline and end-to-end systems, including non-trainable or API-based modules, and promises practical benefits in real-world task-oriented dialogue deployments by reducing training complexity and improving coordination between modules.

Abstract

Post-processing networks (PPNs) are components that modify the outputs of arbitrary modules in task-oriented dialogue systems and are optimized using reinforcement learning (RL) to improve the overall task completion capability of the system. However, previous PPN-based approaches have been limited to handling only a subset of modules within a system, which poses a significant limitation in improving the system performance. In this study, we propose a joint optimization method for post-processing the outputs of all modules using universal post-processing networks (UniPPNs), which are language-model-based networks that can modify the outputs of arbitrary modules in a system as a sequence-transformation task. Moreover, our RL algorithm, which employs a module-level Markov decision process, enables fine-grained value and advantage estimation for each module, thereby stabilizing joint learning for post-processing the outputs of all modules. Through both simulation-based and human evaluation experiments using the MultiWOZ dataset, we demonstrated that UniPPN outperforms conventional PPNs in the task completion capability of task-oriented dialogue systems.

Universal Post-Processing Networks for Joint Optimization of Modules in Task-Oriented Dialogue Systems

TL;DR

This work targets improving task-oriented dialogue systems by enabling joint optimization of all module outputs through a universal post-processing network (UniPPN). UniPPN treats post-processing as a sequence-transformation task controlled by a module-aware language model and trained with a module-level MDP, combining imitation learning for bootstrapping with reinforcement learning via PPO to yield fine-grained credit assignment. Across MultiWOZ-based simulations and human studies, UniPPN outperforms traditional disjoint post-processing approaches (BinPPN and GenPPN), achieving higher task success and often fewer dialogue turns while maintaining user satisfaction. The approach broadens applicability to both pipeline and end-to-end systems, including non-trainable or API-based modules, and promises practical benefits in real-world task-oriented dialogue deployments by reducing training complexity and improving coordination between modules.

Abstract

Post-processing networks (PPNs) are components that modify the outputs of arbitrary modules in task-oriented dialogue systems and are optimized using reinforcement learning (RL) to improve the overall task completion capability of the system. However, previous PPN-based approaches have been limited to handling only a subset of modules within a system, which poses a significant limitation in improving the system performance. In this study, we propose a joint optimization method for post-processing the outputs of all modules using universal post-processing networks (UniPPNs), which are language-model-based networks that can modify the outputs of arbitrary modules in a system as a sequence-transformation task. Moreover, our RL algorithm, which employs a module-level Markov decision process, enables fine-grained value and advantage estimation for each module, thereby stabilizing joint learning for post-processing the outputs of all modules. Through both simulation-based and human evaluation experiments using the MultiWOZ dataset, we demonstrated that UniPPN outperforms conventional PPNs in the task completion capability of task-oriented dialogue systems.

Paper Structure

This paper contains 30 sections, 10 equations, 4 figures, 5 tables, 1 algorithm.

Figures (4)

  • Figure 1: Diagram of UniPPN. UniPPN modifies the output $\text{out}_m$ of $\text{Module}_m$ to $\text{out}_m^+$, which serves as the input for $\text{Module}_{m+1}$.
  • Figure 2: Procedure for creating pseudo-post-processing demonstration data. First, we generate dialogues between the dialogue system and the user simulator. Subsequently, we create pairs of positive and negative outputs, where the output $\text{out}_t$ of module $m$ for context $s_t$ at turn $t$ is positive and the output $\text{out}_u$ at another turn $u$ is negative (i.e., $\text{out}_t^-$). In imitation learning stage, the reconstruction from $\text{out}_t^-$ to $\text{out}_t$ is learned as pseudo-post-processing.
  • Figure 3: Task completion metrics
  • Figure 4: Advantage estimates