Table of Contents
Fetching ...

Learning Discourse-level Diversity for Neural Dialog Models using Conditional Variational Autoencoders

Tiancheng Zhao, Ran Zhao, Maxine Eskenazi

TL;DR

This paper addresses the dull-output problem of open-domain neural dialogs by modeling discourse-level diversity with a conditional variational autoencoder (CVAE). It introduces a knowledge-guided CVAE (kgCVAE) that injects linguistic priors and a bag-of-words (BOW) auxiliary loss to stabilize training and encourage latent usage, enabling diverse yet coherent responses even with greedy decoding. Empirical results on Switchboard show that CVAE and kgCVAE outperform a strong baseline, with kgCVAE achieving strong precision and recall across multiple metrics and enabling interpretable outputs such as predicted dialog acts. The work demonstrates the feasibility of capturing discourse-level variation in dialog generation and lays groundwork for data-driven dialog managers that leverage latent discourse factors.

Abstract

While recent neural encoder-decoder models have shown great promise in modeling open-domain conversations, they often generate dull and generic responses. Unlike past work that has focused on diversifying the output of the decoder at word-level to alleviate this problem, we present a novel framework based on conditional variational autoencoders that captures the discourse-level diversity in the encoder. Our model uses latent variables to learn a distribution over potential conversational intents and generates diverse responses using only greedy decoders. We have further developed a novel variant that is integrated with linguistic prior knowledge for better performance. Finally, the training procedure is improved by introducing a bag-of-word loss. Our proposed models have been validated to generate significantly more diverse responses than baseline approaches and exhibit competence in discourse-level decision-making.

Learning Discourse-level Diversity for Neural Dialog Models using Conditional Variational Autoencoders

TL;DR

This paper addresses the dull-output problem of open-domain neural dialogs by modeling discourse-level diversity with a conditional variational autoencoder (CVAE). It introduces a knowledge-guided CVAE (kgCVAE) that injects linguistic priors and a bag-of-words (BOW) auxiliary loss to stabilize training and encourage latent usage, enabling diverse yet coherent responses even with greedy decoding. Empirical results on Switchboard show that CVAE and kgCVAE outperform a strong baseline, with kgCVAE achieving strong precision and recall across multiple metrics and enabling interpretable outputs such as predicted dialog acts. The work demonstrates the feasibility of capturing discourse-level variation in dialog generation and lays groundwork for data-driven dialog managers that leverage latent discourse factors.

Abstract

While recent neural encoder-decoder models have shown great promise in modeling open-domain conversations, they often generate dull and generic responses. Unlike past work that has focused on diversifying the output of the decoder at word-level to alleviate this problem, we present a novel framework based on conditional variational autoencoders that captures the discourse-level diversity in the encoder. Our model uses latent variables to learn a distribution over potential conversational intents and generates diverse responses using only greedy decoders. We have further developed a novel variant that is integrated with linguistic prior knowledge for better performance. Finally, the training procedure is improved by introducing a bag-of-word loss. Our proposed models have been validated to generate significantly more diverse responses than baseline approaches and exhibit competence in discourse-level decision-making.

Paper Structure

This paper contains 19 sections, 8 equations, 6 figures, 3 tables.

Figures (6)

  • Figure 1: Given A's question, there exists many valid responses from B for different assumptions of the latent variables, e.g., B's hobby.
  • Figure 2: Graphical models of CVAE (a) and kgCVAE (b)
  • Figure 3: The neural network architectures for the baseline and the proposed CVAE/kgCVAE models. $\bigoplus$ denotes the concatenation of the input vectors. The dashed blue connections only appear in kgCVAE.
  • Figure 4: BLEU-4 precision/recall vs. the number of distinct reference dialog acts.
  • Figure 5: t-SNE visualization of the posterior $z$ for test responses with top 8 frequent dialog acts. The size of circle represents the response length.
  • ...and 1 more figures