Scheduled Curiosity-Deep Dyna-Q: Efficient Exploration for Dialog Policy Learning
Xuecheng Niu, Akinori Ito, Takashi Nose
TL;DR
This work tackles data-efficiency in task-oriented dialog policy learning by introducing SC-DDQ, a curiosity-driven curriculum framework built on Deep Dyna-Q. The approach integrates an offline task classifier, a world-model-based planner, and a curiosity policy that augments action selection with intrinsic motivation, evaluated under multiple curriculum schedules with easy-first and difficult-first variants. Key findings show that SC-DDQ and its variants outperform traditional baselines, with the effectiveness of training strategies depending on whether curiosity is used; early-stage exploration (high entropy) followed by convergence (low entropy) yields better final performance. The results have practical implications for deploying data-efficient dialog agents and suggest potential adaptations to other RL-NLP tasks and cyber-defense scenarios.
Abstract
Training task-oriented dialog agents based on reinforcement learning is time-consuming and requires a large number of interactions with real users. How to grasp dialog policy within limited dialog experiences remains an obstacle that makes the agent training process less efficient. In addition, most previous frameworks start training by randomly choosing training samples, which differs from the human learning method and hurts the efficiency and stability of training. Therefore, we propose Scheduled Curiosity-Deep Dyna-Q (SC-DDQ), a curiosity-driven curriculum learning framework based on a state-of-the-art model-based reinforcement learning dialog model, Deep Dyna-Q (DDQ). Furthermore, we designed learning schedules for SC-DDQ and DDQ, respectively, following two opposite training strategies: classic curriculum learning and its reverse version. Our results show that by introducing scheduled learning and curiosity, the new framework leads to a significant improvement over the DDQ and Deep Q-learning(DQN). Surprisingly, we found that traditional curriculum learning was not always effective. Specifically, according to the experimental results, the easy-first and difficult-first strategies are more suitable for SC-DDQ and DDQ. To analyze our results, we adopted the entropy of sampled actions to depict action exploration and found that training strategies with high entropy in the first stage and low entropy in the last stage lead to better performance.
