Advancing Multi-Party Dialogue Framework with Speaker-ware Contrastive Learning
Zhongtian Hu, Qi He, Ronghan Li, Meng Zhao, Lifang Wang
TL;DR
This paper tackles the challenge of multi-party dialogue generation, where traditional graph-based methods rely on annotations and struggle to model consistent speaker styles. It introduces CMR, a two-stage contrastive learning framework that first learns speaker-discriminative utterance representations (Stage I) and then jointly optimizes response generation with a contrastive objective to capture discourse themes (Stage II). Through experiments on Friends and Ubuntu with T5 and LLaMA-3.1 backbones, CMR achieves strong automatic and human evaluations, and the LLM-based preference study shows a clear edge for CMR-equipped models. The approach eliminates the need for annotated graphs, demonstrates robustness to varying speaker counts, and integrates well with large pre-trained models, offering a scalable path for improving multi-party dialogue systems.
Abstract
Multi-party dialogues, common in collaborative scenarios like brainstorming sessions and negotiations, pose significant challenges due to their complexity and diverse speaker roles. Current methods often use graph neural networks to model dialogue context, capturing structural dynamics but heavily relying on annotated graph structures and overlooking individual speaking styles. To address these challenges, we propose CMR, a Contrastive learning-based Multi-party dialogue Response generation framework. CMR employs a two-stage self-supervised contrastive learning framework. First, it captures global differences in speaking styles across individuals. Then, it focuses on intra-conversation comparisons to identify thematic transitions and contextually relevant facts. To the best of our knowledge, this is the first approach that applies contrastive learning in multi-party dialogue generation. Experimental results demonstrate that CMR not only significantly outperforms state-of-the-art models, but also generalizes well to large pre-trained language models, effectively enhancing their capability in handling multi-party conversations.
