Table of Contents
Fetching ...

End-to-end LSTM-based dialog control optimized with supervised and reinforcement learning

Jason D. Williams, Geoffrey Zweig

TL;DR

Addresses the challenge of learning task-oriented dialog policies without hand-crafted state representations. Proposes an end-to-end framework where an LSTM maps raw dialog history to actions, with domain-specific software providing business rules and API access, and an entity grounding module. The model supports both supervised learning and reinforcement learning, including online retraining and action masking to enforce domain constraints. Experiments on a contact-call task show that SL yields a viable initial policy and that RL, starting from SL, speeds up learning and stabilizes performance. This work demonstrates a practical route to deployable, data-efficient, end-to-end dialog controllers.

Abstract

This paper presents a model for end-to-end learning of task-oriented dialog systems. The main component of the model is a recurrent neural network (an LSTM), which maps from raw dialog history directly to a distribution over system actions. The LSTM automatically infers a representation of dialog history, which relieves the system developer of much of the manual feature engineering of dialog state. In addition, the developer can provide software that expresses business rules and provides access to programmatic APIs, enabling the LSTM to take actions in the real world on behalf of the user. The LSTM can be optimized using supervised learning (SL), where a domain expert provides example dialogs which the LSTM should imitate; or using reinforcement learning (RL), where the system improves by interacting directly with end users. Experiments show that SL and RL are complementary: SL alone can derive a reasonable initial policy from a small number of training dialogs; and starting RL optimization with a policy trained with SL substantially accelerates the learning rate of RL.

End-to-end LSTM-based dialog control optimized with supervised and reinforcement learning

TL;DR

Addresses the challenge of learning task-oriented dialog policies without hand-crafted state representations. Proposes an end-to-end framework where an LSTM maps raw dialog history to actions, with domain-specific software providing business rules and API access, and an entity grounding module. The model supports both supervised learning and reinforcement learning, including online retraining and action masking to enforce domain constraints. Experiments on a contact-call task show that SL yields a viable initial policy and that RL, starting from SL, speeds up learning and stabilizes performance. This work demonstrates a practical route to deployable, data-efficient, end-to-end dialog controllers.

Abstract

This paper presents a model for end-to-end learning of task-oriented dialog systems. The main component of the model is a recurrent neural network (an LSTM), which maps from raw dialog history directly to a distribution over system actions. The LSTM automatically infers a representation of dialog history, which relieves the system developer of much of the manual feature engineering of dialog state. In addition, the developer can provide software that expresses business rules and provides access to programmatic APIs, enabling the LSTM to take actions in the real world on behalf of the user. The LSTM can be optimized using supervised learning (SL), where a domain expert provides example dialogs which the LSTM should imitate; or using reinforcement learning (RL), where the system improves by interacting directly with end users. Experiments show that SL and RL are complementary: SL alone can derive a reasonable initial policy from a small number of training dialogs; and starting RL optimization with a policy trained with SL substantially accelerates the learning rate of RL.

Paper Structure

This paper contains 13 sections, 1 equation, 5 figures, 1 table.

Figures (5)

  • Figure 1: Operational loop. Green trapezoids refer to programmatic code provided by the software developer. The blue boxes indicate the recurrent neural network, with trainable parameters. The orange box performs entity extraction. The vertical bars in steps 4 and 8 are a feature vector and a distribution over template actions, respectively. See text for a complete description.
  • Figure 2: One of the 21 example dialogs used for supervised learning training. For space, the entity tags that appear in the user and system sides of the dialogs have been removed -- for example, Call <name>Jason</name> is shown as Call Jason. See Appendix \ref{['app:example_dialogs']} for additional examples.
  • Figure 3: Average accuracy of leave-one-out cross-fold validation. The $x$ axis shows the number of training dialogs used to train the LSTM. The $y$ axis shows average accuracy on the one held-out dialog, where green bars show average accuracy measured per turn, and blue bars show average accuracy per dialog. A dialog is considered accurate if it contains zero prediction errors.
  • Figure 4: ROC plot of the scores of the actions selected by the LSTM. False positive rate is the number of incorrectly predicted actions above a threshold $r$ divided by the total number of incorrectly predicted actions; true positive rate (TPR) is the number of correctly predicted actions above the threshold $r$ divided by the total number of correctly predicted actions.
  • Figure 5: Task completion rate (TCR) mean and standard deviation for a policy initially trained with $N = (0, 1, 2, 5, 10)$ dialogs using supervised learning (SL), and then optimized with $0$ to $10,000$ dialogs using reinforcement learning (RL). Training and evaluation were done with the same stochastic simulated user. Each line shows the average of 10 runs, where the dialogs used in the SL training in each run were randomly sampled from the 21 example dialogs.