Table of Contents
Fetching ...

DialogLM: Pre-trained Model for Long Dialogue Understanding and Summarization

Ming Zhong, Yang Liu, Yichong Xu, Chenguang Zhu, Michael Zeng

TL;DR

Long-form, multi-person dialogues pose unique challenges for standard NLP models. DialogLM introduces a window-based denoising pre-training task coupled with a hybrid attention Transformer to learn dialog structure and process sequences exceeding thousands of words. Trained on MediaSum and OpenSubtitles, and evaluated across five benchmarks (AMI, ICSI, QMSum, ForeverDreaming, TVMegaSite), DialogLM achieves state-of-the-art results in long-dialogue summarization, abstractive QA, and topic segmentation, with further gains from a sparse-attention variant and LED-based adaptation. Comprehensive ablations and human evaluations corroborate the effectiveness and reliability of the approach, and the authors release all models and code publicly. This work advances practical long-dialogue understanding and generation, enabling scalable analysis of meetings and screenplays.

Abstract

Dialogue is an essential part of human communication and cooperation. Existing research mainly focuses on short dialogue scenarios in a one-on-one fashion. However, multi-person interactions in the real world, such as meetings or interviews, are frequently over a few thousand words. There is still a lack of corresponding research and powerful tools to understand and process such long dialogues. Therefore, in this work, we present a pre-training framework for long dialogue understanding and summarization. Considering the nature of long conversations, we propose a window-based denoising approach for generative pre-training. For a dialogue, it corrupts a window of text with dialogue-inspired noise, and guides the model to reconstruct this window based on the content of the remaining conversation. Furthermore, to process longer input, we augment the model with sparse attention which is combined with conventional attention in a hybrid manner. We conduct extensive experiments on five datasets of long dialogues, covering tasks of dialogue summarization, abstractive question answering and topic segmentation. Experimentally, we show that our pre-trained model DialogLM significantly surpasses the state-of-the-art models across datasets and tasks. Source code and all the pre-trained models are available on our GitHub repository (https://github.com/microsoft/DialogLM).

DialogLM: Pre-trained Model for Long Dialogue Understanding and Summarization

TL;DR

Long-form, multi-person dialogues pose unique challenges for standard NLP models. DialogLM introduces a window-based denoising pre-training task coupled with a hybrid attention Transformer to learn dialog structure and process sequences exceeding thousands of words. Trained on MediaSum and OpenSubtitles, and evaluated across five benchmarks (AMI, ICSI, QMSum, ForeverDreaming, TVMegaSite), DialogLM achieves state-of-the-art results in long-dialogue summarization, abstractive QA, and topic segmentation, with further gains from a sparse-attention variant and LED-based adaptation. Comprehensive ablations and human evaluations corroborate the effectiveness and reliability of the approach, and the authors release all models and code publicly. This work advances practical long-dialogue understanding and generation, enabling scalable analysis of meetings and screenplays.

Abstract

Dialogue is an essential part of human communication and cooperation. Existing research mainly focuses on short dialogue scenarios in a one-on-one fashion. However, multi-person interactions in the real world, such as meetings or interviews, are frequently over a few thousand words. There is still a lack of corresponding research and powerful tools to understand and process such long dialogues. Therefore, in this work, we present a pre-training framework for long dialogue understanding and summarization. Considering the nature of long conversations, we propose a window-based denoising approach for generative pre-training. For a dialogue, it corrupts a window of text with dialogue-inspired noise, and guides the model to reconstruct this window based on the content of the remaining conversation. Furthermore, to process longer input, we augment the model with sparse attention which is combined with conventional attention in a hybrid manner. We conduct extensive experiments on five datasets of long dialogues, covering tasks of dialogue summarization, abstractive question answering and topic segmentation. Experimentally, we show that our pre-trained model DialogLM significantly surpasses the state-of-the-art models across datasets and tasks. Source code and all the pre-trained models are available on our GitHub repository (https://github.com/microsoft/DialogLM).

Paper Structure

This paper contains 25 sections, 2 figures, 9 tables.

Figures (2)

  • Figure 1: Pre-train task for DialogLM: window-based denoising. We firstly select a window containing multiple turns, and inject different dialogue-inspired noises into it. Finally, we train the model to restore this window based on the noisy window and the rest of the dialogue.
  • Figure 2: Model architecture for DialogLM. We introduce a hybrid attention approach in Transformer architecture: most layers are equipped with a sparse attention method (Sinkhorn attention) and the rest retain global self-attention.