Table of Contents
Fetching ...

End-to-End Offline Goal-Oriented Dialog Policy Learning via Policy Gradient

Li Zhou, Kevin Small, Oleg Rokhlenko, Charles Elkan

TL;DR

The paper tackles offline learning of goal-oriented dialog policies from unannotated corpora, avoiding online interaction and predefined action spaces. It models agent utterance generation as a token-level MDP and uses an encoder-decoder policy trained via a combination of on-policy and off-policy policy gradient with a novel reward function that combines utterance-level similarity (BLEU) and dialog-level API-call timing/correctness. Reward shaping is applied to address sparse rewards, and the approach leverages high-quality transcripts (TACT) to accelerate convergence. On the bAbI task 6 restaurant domain, the method outperforms strong seq2seq-based baselines on both utterance and API-call metrics, demonstrating practical potential for end-to-end offline deployment.

Abstract

Learning a goal-oriented dialog policy is generally performed offline with supervised learning algorithms or online with reinforcement learning (RL). Additionally, as companies accumulate massive quantities of dialog transcripts between customers and trained human agents, encoder-decoder methods have gained popularity as agent utterances can be directly treated as supervision without the need for utterance-level annotations. However, one potential drawback of such approaches is that they myopically generate the next agent utterance without regard for dialog-level considerations. To resolve this concern, this paper describes an offline RL method for learning from unannotated corpora that can optimize a goal-oriented policy at both the utterance and dialog level. We introduce a novel reward function and use both on-policy and off-policy policy gradient to learn a policy offline without requiring online user interaction or an explicit state space definition.

End-to-End Offline Goal-Oriented Dialog Policy Learning via Policy Gradient

TL;DR

The paper tackles offline learning of goal-oriented dialog policies from unannotated corpora, avoiding online interaction and predefined action spaces. It models agent utterance generation as a token-level MDP and uses an encoder-decoder policy trained via a combination of on-policy and off-policy policy gradient with a novel reward function that combines utterance-level similarity (BLEU) and dialog-level API-call timing/correctness. Reward shaping is applied to address sparse rewards, and the approach leverages high-quality transcripts (TACT) to accelerate convergence. On the bAbI task 6 restaurant domain, the method outperforms strong seq2seq-based baselines on both utterance and API-call metrics, demonstrating practical potential for end-to-end offline deployment.

Abstract

Learning a goal-oriented dialog policy is generally performed offline with supervised learning algorithms or online with reinforcement learning (RL). Additionally, as companies accumulate massive quantities of dialog transcripts between customers and trained human agents, encoder-decoder methods have gained popularity as agent utterances can be directly treated as supervision without the need for utterance-level annotations. However, one potential drawback of such approaches is that they myopically generate the next agent utterance without regard for dialog-level considerations. To resolve this concern, this paper describes an offline RL method for learning from unannotated corpora that can optimize a goal-oriented policy at both the utterance and dialog level. We introduce a novel reward function and use both on-policy and off-policy policy gradient to learn a policy offline without requiring online user interaction or an explicit state space definition.

Paper Structure

This paper contains 12 sections, 7 equations, 1 table, 1 algorithm.