Table of Contents
Fetching ...

Scheduled Curiosity-Deep Dyna-Q: Efficient Exploration for Dialog Policy Learning

Xuecheng Niu, Akinori Ito, Takashi Nose

TL;DR

This work tackles data-efficiency in task-oriented dialog policy learning by introducing SC-DDQ, a curiosity-driven curriculum framework built on Deep Dyna-Q. The approach integrates an offline task classifier, a world-model-based planner, and a curiosity policy that augments action selection with intrinsic motivation, evaluated under multiple curriculum schedules with easy-first and difficult-first variants. Key findings show that SC-DDQ and its variants outperform traditional baselines, with the effectiveness of training strategies depending on whether curiosity is used; early-stage exploration (high entropy) followed by convergence (low entropy) yields better final performance. The results have practical implications for deploying data-efficient dialog agents and suggest potential adaptations to other RL-NLP tasks and cyber-defense scenarios.

Abstract

Training task-oriented dialog agents based on reinforcement learning is time-consuming and requires a large number of interactions with real users. How to grasp dialog policy within limited dialog experiences remains an obstacle that makes the agent training process less efficient. In addition, most previous frameworks start training by randomly choosing training samples, which differs from the human learning method and hurts the efficiency and stability of training. Therefore, we propose Scheduled Curiosity-Deep Dyna-Q (SC-DDQ), a curiosity-driven curriculum learning framework based on a state-of-the-art model-based reinforcement learning dialog model, Deep Dyna-Q (DDQ). Furthermore, we designed learning schedules for SC-DDQ and DDQ, respectively, following two opposite training strategies: classic curriculum learning and its reverse version. Our results show that by introducing scheduled learning and curiosity, the new framework leads to a significant improvement over the DDQ and Deep Q-learning(DQN). Surprisingly, we found that traditional curriculum learning was not always effective. Specifically, according to the experimental results, the easy-first and difficult-first strategies are more suitable for SC-DDQ and DDQ. To analyze our results, we adopted the entropy of sampled actions to depict action exploration and found that training strategies with high entropy in the first stage and low entropy in the last stage lead to better performance.

Scheduled Curiosity-Deep Dyna-Q: Efficient Exploration for Dialog Policy Learning

TL;DR

This work tackles data-efficiency in task-oriented dialog policy learning by introducing SC-DDQ, a curiosity-driven curriculum framework built on Deep Dyna-Q. The approach integrates an offline task classifier, a world-model-based planner, and a curiosity policy that augments action selection with intrinsic motivation, evaluated under multiple curriculum schedules with easy-first and difficult-first variants. Key findings show that SC-DDQ and its variants outperform traditional baselines, with the effectiveness of training strategies depending on whether curiosity is used; early-stage exploration (high entropy) followed by convergence (low entropy) yields better final performance. The results have practical implications for deploying data-efficient dialog agents and suggest potential adaptations to other RL-NLP tasks and cyber-defense scenarios.

Abstract

Training task-oriented dialog agents based on reinforcement learning is time-consuming and requires a large number of interactions with real users. How to grasp dialog policy within limited dialog experiences remains an obstacle that makes the agent training process less efficient. In addition, most previous frameworks start training by randomly choosing training samples, which differs from the human learning method and hurts the efficiency and stability of training. Therefore, we propose Scheduled Curiosity-Deep Dyna-Q (SC-DDQ), a curiosity-driven curriculum learning framework based on a state-of-the-art model-based reinforcement learning dialog model, Deep Dyna-Q (DDQ). Furthermore, we designed learning schedules for SC-DDQ and DDQ, respectively, following two opposite training strategies: classic curriculum learning and its reverse version. Our results show that by introducing scheduled learning and curiosity, the new framework leads to a significant improvement over the DDQ and Deep Q-learning(DQN). Surprisingly, we found that traditional curriculum learning was not always effective. Specifically, according to the experimental results, the easy-first and difficult-first strategies are more suitable for SC-DDQ and DDQ. To analyze our results, we adopted the entropy of sampled actions to depict action exploration and found that training strategies with high entropy in the first stage and low entropy in the last stage lead to better performance.
Paper Structure (24 sections, 2 equations, 11 figures, 6 tables, 1 algorithm)

This paper contains 24 sections, 2 equations, 11 figures, 6 tables, 1 algorithm.

Figures (11)

  • Figure 1: The framework of DDQ
  • Figure 2: The structure of world model
  • Figure 3: The framework of Scheduled Curiosity-Deep Dyna-Q
  • Figure 4: A success task completed by rule-based agent
  • Figure 5: User goals in different difficulty level. UNK means that the corresponding slot is unknown.
  • ...and 6 more figures