Table of Contents
Fetching ...

Improving Multi-Domain Task-Oriented Dialogue System with Offline Reinforcement Learning

Dharmendra Prajapat, Durga Toshniwal

TL;DR

A TOD system that leverages a unified pre-trained language model, GPT-2, as a base model is proposed that is optimized using supervised learning and offline reinforcement learning (RL) and mitigated using a non-differentiable reward function.

Abstract

Task-oriented dialogue (TOD) system is designed to accomplish user-defined tasks through dialogues. The TOD system has progressed towards end-to-end modeling by leveraging pre-trained large language models. Fine-tuning the pre-trained language models using only supervised learning leads to the exposure bias and token loss problem and it deviates the models from completing the user's task. To address these issues, we propose a TOD system that leverages a unified pre-trained language model, GPT2, as a base model. It is optimized using supervised learning and reinforcement learning (RL). The issues in the TOD system are mitigated using a non-differentiable reward function. The reward is calculated using the weighted sum of the success rate and BLEU evaluation metrics. The success rate and BLEU metrics in reward calculation guide the language model for user task completion while ensuring a coherent and fluent response. Our model is acquired by fine-tuning a pre-trained model on the dialogue-session level which comprises user utterance, belief state, system act, and system response. Experimental results on MultiWOZ2.1 demonstrate that our model increases the inform rate by 1.60% and the success rate by 3.17% compared to the baseline.

Improving Multi-Domain Task-Oriented Dialogue System with Offline Reinforcement Learning

TL;DR

A TOD system that leverages a unified pre-trained language model, GPT-2, as a base model is proposed that is optimized using supervised learning and offline reinforcement learning (RL) and mitigated using a non-differentiable reward function.

Abstract

Task-oriented dialogue (TOD) system is designed to accomplish user-defined tasks through dialogues. The TOD system has progressed towards end-to-end modeling by leveraging pre-trained large language models. Fine-tuning the pre-trained language models using only supervised learning leads to the exposure bias and token loss problem and it deviates the models from completing the user's task. To address these issues, we propose a TOD system that leverages a unified pre-trained language model, GPT2, as a base model. It is optimized using supervised learning and reinforcement learning (RL). The issues in the TOD system are mitigated using a non-differentiable reward function. The reward is calculated using the weighted sum of the success rate and BLEU evaluation metrics. The success rate and BLEU metrics in reward calculation guide the language model for user task completion while ensuring a coherent and fluent response. Our model is acquired by fine-tuning a pre-trained model on the dialogue-session level which comprises user utterance, belief state, system act, and system response. Experimental results on MultiWOZ2.1 demonstrate that our model increases the inform rate by 1.60% and the success rate by 3.17% compared to the baseline.

Paper Structure

This paper contains 20 sections, 8 equations, 5 figures, 6 tables, 1 algorithm.

Figures (5)

  • Figure 1: A multi-domain dialogue between a user and a task-oriented dialogue system
  • Figure 2: A dialogue session contains user utterance, belief state, DB search results, system acts, and system response of all dialogue turns and each one is colored differently.
  • Figure 3: The proposed architecture is fine-tuned with supervised learning which is next token prediction and reinforcement learning which is shown as a reward model.
  • Figure 4: The proposed method and other baseline methods, are trained on the complete training dataset and evaluated on single-domain, multi-domain, and full-set. Here, we illustrate the results on the MultiWOZ2.1 test dataset. All experiments are conducted in an end-to-end setting.
  • Figure 5: Performance of our system across different dialogue turns size on MultiWOZ2.1 test dataset.