Table of Contents
Fetching ...

Anti-Overestimation Dialogue Policy Learning for Task-Completion Dialogue System

Chang Tian, Wenpeng Yin, Marie-Francine Moens

TL;DR

This work tackles the overestimation problem in RL-based task-oriented dialogue policy by introducing Dynamic Partial Average (DPAV), an estimator that blends the predicted maximum and minimum action values with a dynamic weight to reduce bias. Integrated into a DPAV DQN, it comes with a convergence proof and bias/variance bounds that position it favorably against ME, DE, and ensemble methods. Empirically, DPAV DQN achieves better or comparable performance across movie, restaurant, and taxi datasets while reducing computational load, demonstrating practical utility for robust dialogue policy learning. The approach offers a principled, efficient alternative to ensembles with strong theoretical guarantees and improved action-value estimation.

Abstract

A dialogue policy module is an essential part of task-completion dialogue systems. Recently, increasing interest has focused on reinforcement learning (RL)-based dialogue policy. Its favorable performance and wise action decisions rely on an accurate estimation of action values. The overestimation problem is a widely known issue of RL since its estimate of the maximum action value is larger than the ground truth, which results in an unstable learning process and suboptimal policy. This problem is detrimental to RL-based dialogue policy learning. To mitigate this problem, this paper proposes a dynamic partial average estimator (DPAV) of the ground truth maximum action value. DPAV calculates the partial average between the predicted maximum action value and minimum action value, where the weights are dynamically adaptive and problem-dependent. We incorporate DPAV into a deep Q-network as the dialogue policy and show that our method can achieve better or comparable results compared to top baselines on three dialogue datasets of different domains with a lower computational load. In addition, we also theoretically prove the convergence and derive the upper and lower bounds of the bias compared with those of other methods.

Anti-Overestimation Dialogue Policy Learning for Task-Completion Dialogue System

TL;DR

This work tackles the overestimation problem in RL-based task-oriented dialogue policy by introducing Dynamic Partial Average (DPAV), an estimator that blends the predicted maximum and minimum action values with a dynamic weight to reduce bias. Integrated into a DPAV DQN, it comes with a convergence proof and bias/variance bounds that position it favorably against ME, DE, and ensemble methods. Empirically, DPAV DQN achieves better or comparable performance across movie, restaurant, and taxi datasets while reducing computational load, demonstrating practical utility for robust dialogue policy learning. The approach offers a principled, efficient alternative to ensembles with strong theoretical guarantees and improved action-value estimation.

Abstract

A dialogue policy module is an essential part of task-completion dialogue systems. Recently, increasing interest has focused on reinforcement learning (RL)-based dialogue policy. Its favorable performance and wise action decisions rely on an accurate estimation of action values. The overestimation problem is a widely known issue of RL since its estimate of the maximum action value is larger than the ground truth, which results in an unstable learning process and suboptimal policy. This problem is detrimental to RL-based dialogue policy learning. To mitigate this problem, this paper proposes a dynamic partial average estimator (DPAV) of the ground truth maximum action value. DPAV calculates the partial average between the predicted maximum action value and minimum action value, where the weights are dynamically adaptive and problem-dependent. We incorporate DPAV into a deep Q-network as the dialogue policy and show that our method can achieve better or comparable results compared to top baselines on three dialogue datasets of different domains with a lower computational load. In addition, we also theoretically prove the convergence and derive the upper and lower bounds of the bias compared with those of other methods.
Paper Structure (37 sections, 13 equations, 4 figures, 3 tables, 1 algorithm)

This paper contains 37 sections, 13 equations, 4 figures, 3 tables, 1 algorithm.

Figures (4)

  • Figure 1: Task-completion dialogue system
  • Figure 2: Wrong and correct dialogue actions
  • Figure 3: The top row shows the learning curves of dialogue policies. The X-axis is the number of training epochs and the Y-axis is the success rate of dialogue policies on the test dataset. The second row shows the averaged reward of each dialogue in the test dataset. The third row shows the influences of different initial $\lambda$ values and value search schemes. The X-axis and Y-axis are the same as those of the top row. Each learning curve is averaged over 3 runs on the test dataset.
  • Figure 4: The learning curves of the averaged maximal action value of the dialogue starting state when dialogue policies are evaluated on the movie test set during the training. The Y-axis means the averaged maximal action value of the starting state.