Table of Contents
Fetching ...

Multi-View Sequence-to-Sequence Models with Conversational Structure for Abstractive Dialogue Summarization

Jiaao Chen, Diyi Yang

TL;DR

This work tackles abstractive dialogue summarization by introducing a multi-view sequence-to-sequence model that leverages diverse conversational structures. It automatically extracts four views (topic, stage, global, discrete) and uses a multi-view attention mechanism on a BART-based encoder-decoder to generate summaries. On the SAMSum dataset, the model with topic and stage views consistently outperforms baselines in ROUGE scores and is favorably rated by human evaluators, with analysis highlighting the advantages of combining views and outlining remaining challenges such as missing information and incorrect reasoning. The approach provides a practical, scalable framework for structured dialogue understanding and summarization, with publicly available code for reproducibility and further research.

Abstract

Text summarization is one of the most challenging and interesting problems in NLP. Although much attention has been paid to summarizing structured text like news reports or encyclopedia articles, summarizing conversations---an essential part of human-human/machine interaction where most important pieces of information are scattered across various utterances of different speakers---remains relatively under-investigated. This work proposes a multi-view sequence-to-sequence model by first extracting conversational structures of unstructured daily chats from different views to represent conversations and then utilizing a multi-view decoder to incorporate different views to generate dialogue summaries. Experiments on a large-scale dialogue summarization corpus demonstrated that our methods significantly outperformed previous state-of-the-art models via both automatic evaluations and human judgment. We also discussed specific challenges that current approaches faced with this task. We have publicly released our code at https://github.com/GT-SALT/Multi-View-Seq2Seq.

Multi-View Sequence-to-Sequence Models with Conversational Structure for Abstractive Dialogue Summarization

TL;DR

This work tackles abstractive dialogue summarization by introducing a multi-view sequence-to-sequence model that leverages diverse conversational structures. It automatically extracts four views (topic, stage, global, discrete) and uses a multi-view attention mechanism on a BART-based encoder-decoder to generate summaries. On the SAMSum dataset, the model with topic and stage views consistently outperforms baselines in ROUGE scores and is favorably rated by human evaluators, with analysis highlighting the advantages of combining views and outlining remaining challenges such as missing information and incorrect reasoning. The approach provides a practical, scalable framework for structured dialogue understanding and summarization, with publicly available code for reproducibility and further research.

Abstract

Text summarization is one of the most challenging and interesting problems in NLP. Although much attention has been paid to summarizing structured text like news reports or encyclopedia articles, summarizing conversations---an essential part of human-human/machine interaction where most important pieces of information are scattered across various utterances of different speakers---remains relatively under-investigated. This work proposes a multi-view sequence-to-sequence model by first extracting conversational structures of unstructured daily chats from different views to represent conversations and then utilizing a multi-view decoder to incorporate different views to generate dialogue summaries. Experiments on a large-scale dialogue summarization corpus demonstrated that our methods significantly outperformed previous state-of-the-art models via both automatic evaluations and human judgment. We also discussed specific challenges that current approaches faced with this task. We have publicly released our code at https://github.com/GT-SALT/Multi-View-Seq2Seq.

Paper Structure

This paper contains 27 sections, 6 equations, 6 figures, 9 tables.

Figures (6)

  • Figure 1: Model architecture. Different views of conversations are first extracted automatically, and then encoded through the conversation encoder (a) and combined in the multi-view decoder to generate summaries (b). In the conversation encoder, each view (consists of blocks) is encoded separately and the block's representations $S_i$ are encoded through LSTM to represent the view. In the multi-view decoder, the model decides attention weights over different views and then attend to each token in different views through the multi-view attention.
  • Figure 2: Allowed state transitions for the HMM conversation model. $S_i$ are conversation stages, $O_i$ are sentences' encoded representations. Conversation stages evolve in an increasing order from 1 to $n$.
  • Figure 3: Relations between ROUGE scores and the number of participants/turns in conversations.
  • Figure 4: Human evaluation results. The mean score for each model is also shown in the box plot.
  • Figure 5: Relations between difficulties in conversations and errors made by our model.
  • ...and 1 more figures