Table of Contents
Fetching ...

SPECTRUM: Speaker-Enhanced Pre-Training for Long Dialogue Summarization

Sangwoo Cho, Kaiqiang Song, Chao Zhao, Xiaoyang Wang, Dong Yu

TL;DR

SPECTRUM tackles long-dialogue summarization by injecting speaker-turn signals into a two-stage pre-training regime for encoder–decoder models operating with sparse attention. It introduces three objectives—Sentence Generation, Speaker Turn Prediction, and an MLM component removed in practice—combined as $\mathcal{L}(\Theta) = \mathcal{L}_{gen} + \beta\mathcal{L}_{turn}$ to leverage the structure of multi-turn dialogues. Through curated, diverse pre-training data and careful alignment of data-length distributions across stages, SPECTRUM achieves state-of-the-art or competitive Rouge-based performance on long meeting and TV-show datasets, outperforming existing baselines and even larger models in some cases. The work demonstrates the importance of dataset composition and targeted objectives for long-context dialogue understanding and offers practical guidance for pre-training strategies in this domain.

Abstract

Multi-turn dialogues are characterized by their extended length and the presence of turn-taking conversations. Traditional language models often overlook the distinct features of these dialogues by treating them as regular text. In this paper, we propose a speaker-enhanced pre-training method for long dialogue summarization, which leverages the inherent structure of multiple-turn dialogues. To support our study, we curate a diverse dataset that includes transcripts from real-world scenarios, movie or TV show transcripts, and dialogues generated by a Large Language Model. We then perform a pre-training, which encompasses the detection of speaker changes, and masked utterance generation. Experimental results of fine-tuned models demonstrate that our model achieves state-of-the-art performance on downstream benchmarks with long context, surpassing baseline models and highlighting the effectiveness of our approach. Our findings highlight the importance of curating pre-training datasets that exhibit diversity and variations in length distribution to ensure effective alignment with downstream datasets.

SPECTRUM: Speaker-Enhanced Pre-Training for Long Dialogue Summarization

TL;DR

SPECTRUM tackles long-dialogue summarization by injecting speaker-turn signals into a two-stage pre-training regime for encoder–decoder models operating with sparse attention. It introduces three objectives—Sentence Generation, Speaker Turn Prediction, and an MLM component removed in practice—combined as to leverage the structure of multi-turn dialogues. Through curated, diverse pre-training data and careful alignment of data-length distributions across stages, SPECTRUM achieves state-of-the-art or competitive Rouge-based performance on long meeting and TV-show datasets, outperforming existing baselines and even larger models in some cases. The work demonstrates the importance of dataset composition and targeted objectives for long-context dialogue understanding and offers practical guidance for pre-training strategies in this domain.

Abstract

Multi-turn dialogues are characterized by their extended length and the presence of turn-taking conversations. Traditional language models often overlook the distinct features of these dialogues by treating them as regular text. In this paper, we propose a speaker-enhanced pre-training method for long dialogue summarization, which leverages the inherent structure of multiple-turn dialogues. To support our study, we curate a diverse dataset that includes transcripts from real-world scenarios, movie or TV show transcripts, and dialogues generated by a Large Language Model. We then perform a pre-training, which encompasses the detection of speaker changes, and masked utterance generation. Experimental results of fine-tuned models demonstrate that our model achieves state-of-the-art performance on downstream benchmarks with long context, surpassing baseline models and highlighting the effectiveness of our approach. Our findings highlight the importance of curating pre-training datasets that exhibit diversity and variations in length distribution to ensure effective alignment with downstream datasets.
Paper Structure (22 sections, 4 equations, 2 figures, 8 tables)

This paper contains 22 sections, 4 equations, 2 figures, 8 tables.

Figures (2)

  • Figure 1: The proposed approach implements a speaker-enhanced learning objective within an encoder-decoder model. Two distinct forward paths are incorporated to enhance dialogue understanding. The first path involves only the encoder component, which predicts speaker turn switches. The second path integrates both the encoder and decoder components to generate masked sentences within dialogues, further improving dialogue comprehension.
  • Figure 2: F1, precision, recall scores for the turn switch prediction. Each figure shows the performances with the corresponding S2 pre-training data with ('-ckpt') or without parameter initialization pre-trained on S1.