Table of Contents
Fetching ...

Conversation Disentanglement with Bi-Level Contrastive Learning

Chengyu Huang, Zheng Zhang, Hao Fei, Lizi Liao

TL;DR

This work tackles conversation disentanglement by addressing both local utterance relations and global session structure. It introduces Bi-CL, a bi-level contrastive learning framework that jointly optimizes utterance-level and session-level objectives, augmented by session prototypes and a clustering-aligned training loop. The method supports both supervised and unsupervised settings, employing a learnable K predictor, K-Means clustering, and Hungarian alignment to produce coherent session partitions, with an EM-like extension for unlabeled data. Empirical results on Ubuntu IRC and Movie Dialogue datasets demonstrate state-of-the-art performance and robustness across settings, with ablations confirming the critical roles of both contrastive losses and centroid alignment for effective disentanglement.

Abstract

Conversation disentanglement aims to group utterances into detached sessions, which is a fundamental task in processing multi-party conversations. Existing methods have two main drawbacks. First, they overemphasize pairwise utterance relations but pay inadequate attention to the utterance-to-context relation modeling. Second, huge amount of human annotated data is required for training, which is expensive to obtain in practice. To address these issues, we propose a general disentangle model based on bi-level contrastive learning. It brings closer utterances in the same session while encourages each utterance to be near its clustered session prototypes in the representation space. Unlike existing approaches, our disentangle model works in both supervised setting with labeled data and unsupervised setting when no such data is available. The proposed method achieves new state-of-the-art performance on both settings across several public datasets.

Conversation Disentanglement with Bi-Level Contrastive Learning

TL;DR

This work tackles conversation disentanglement by addressing both local utterance relations and global session structure. It introduces Bi-CL, a bi-level contrastive learning framework that jointly optimizes utterance-level and session-level objectives, augmented by session prototypes and a clustering-aligned training loop. The method supports both supervised and unsupervised settings, employing a learnable K predictor, K-Means clustering, and Hungarian alignment to produce coherent session partitions, with an EM-like extension for unlabeled data. Empirical results on Ubuntu IRC and Movie Dialogue datasets demonstrate state-of-the-art performance and robustness across settings, with ablations confirming the critical roles of both contrastive losses and centroid alignment for effective disentanglement.

Abstract

Conversation disentanglement aims to group utterances into detached sessions, which is a fundamental task in processing multi-party conversations. Existing methods have two main drawbacks. First, they overemphasize pairwise utterance relations but pay inadequate attention to the utterance-to-context relation modeling. Second, huge amount of human annotated data is required for training, which is expensive to obtain in practice. To address these issues, we propose a general disentangle model based on bi-level contrastive learning. It brings closer utterances in the same session while encourages each utterance to be near its clustered session prototypes in the representation space. Unlike existing approaches, our disentangle model works in both supervised setting with labeled data and unsupervised setting when no such data is available. The proposed method achieves new state-of-the-art performance on both settings across several public datasets.
Paper Structure (26 sections, 20 equations, 2 figures, 3 tables)

This paper contains 26 sections, 20 equations, 2 figures, 3 tables.

Figures (2)

  • Figure 1: An example piece of conversation from the Ubuntu IRC corpus. There are distribution patterns in both utterance level and session level.
  • Figure 2: Overview of the proposed Bi-CL framework. It incorporates utterance level contrastive loss to discriminate utterances, and uses session level contrastive loss to encourage them flocking around session centers.