SPECTRUM: Speaker-Enhanced Pre-Training for Long Dialogue Summarization
Sangwoo Cho, Kaiqiang Song, Chao Zhao, Xiaoyang Wang, Dong Yu
TL;DR
SPECTRUM tackles long-dialogue summarization by injecting speaker-turn signals into a two-stage pre-training regime for encoder–decoder models operating with sparse attention. It introduces three objectives—Sentence Generation, Speaker Turn Prediction, and an MLM component removed in practice—combined as $\mathcal{L}(\Theta) = \mathcal{L}_{gen} + \beta\mathcal{L}_{turn}$ to leverage the structure of multi-turn dialogues. Through curated, diverse pre-training data and careful alignment of data-length distributions across stages, SPECTRUM achieves state-of-the-art or competitive Rouge-based performance on long meeting and TV-show datasets, outperforming existing baselines and even larger models in some cases. The work demonstrates the importance of dataset composition and targeted objectives for long-context dialogue understanding and offers practical guidance for pre-training strategies in this domain.
Abstract
Multi-turn dialogues are characterized by their extended length and the presence of turn-taking conversations. Traditional language models often overlook the distinct features of these dialogues by treating them as regular text. In this paper, we propose a speaker-enhanced pre-training method for long dialogue summarization, which leverages the inherent structure of multiple-turn dialogues. To support our study, we curate a diverse dataset that includes transcripts from real-world scenarios, movie or TV show transcripts, and dialogues generated by a Large Language Model. We then perform a pre-training, which encompasses the detection of speaker changes, and masked utterance generation. Experimental results of fine-tuned models demonstrate that our model achieves state-of-the-art performance on downstream benchmarks with long context, surpassing baseline models and highlighting the effectiveness of our approach. Our findings highlight the importance of curating pre-training datasets that exhibit diversity and variations in length distribution to ensure effective alignment with downstream datasets.
