Table of Contents
Fetching ...

Unsupervised Multimodal Clustering for Semantics Discovery in Multimodal Utterances

Hanlei Zhang, Hua Xu, Fei Long, Xin Wang, Kai Gao

TL;DR

The paper tackles unsupervised semantics discovery for multimodal utterances by introducing UMC, a method that jointly leverages text, video, and audio through novel augmentation and high-quality sample selection. It constructs robust multimodal representations via a text-centric anchor with nonverbal masking augmentations, followed by density-driven selection of high-quality samples and an iterative learning scheme that alternates supervised and unsupervised contrastive objectives. Key contributions include a density-based curriculum for sample quality, automatic per-cluster determination of the top-k neighbors, and centroid inheritance to boost clustering quality. Empirical results on MIntRec, MELD-DA, and IEMOCAP-DA show consistent 2-6% improvements over state-of-the-art baselines in clustering metrics, underscoring the value of nonverbal cues for unsupervised multimodal semantics discovery and its potential in real-world human machine interactions.

Abstract

Discovering the semantics of multimodal utterances is essential for understanding human language and enhancing human-machine interactions. Existing methods manifest limitations in leveraging nonverbal information for discerning complex semantics in unsupervised scenarios. This paper introduces a novel unsupervised multimodal clustering method (UMC), making a pioneering contribution to this field. UMC introduces a unique approach to constructing augmentation views for multimodal data, which are then used to perform pre-training to establish well-initialized representations for subsequent clustering. An innovative strategy is proposed to dynamically select high-quality samples as guidance for representation learning, gauged by the density of each sample's nearest neighbors. Besides, it is equipped to automatically determine the optimal value for the top-$K$ parameter in each cluster to refine sample selection. Finally, both high- and low-quality samples are used to learn representations conducive to effective clustering. We build baselines on benchmark multimodal intent and dialogue act datasets. UMC shows remarkable improvements of 2-6\% scores in clustering metrics over state-of-the-art methods, marking the first successful endeavor in this domain. The complete code and data are available at https://github.com/thuiar/UMC.

Unsupervised Multimodal Clustering for Semantics Discovery in Multimodal Utterances

TL;DR

The paper tackles unsupervised semantics discovery for multimodal utterances by introducing UMC, a method that jointly leverages text, video, and audio through novel augmentation and high-quality sample selection. It constructs robust multimodal representations via a text-centric anchor with nonverbal masking augmentations, followed by density-driven selection of high-quality samples and an iterative learning scheme that alternates supervised and unsupervised contrastive objectives. Key contributions include a density-based curriculum for sample quality, automatic per-cluster determination of the top-k neighbors, and centroid inheritance to boost clustering quality. Empirical results on MIntRec, MELD-DA, and IEMOCAP-DA show consistent 2-6% improvements over state-of-the-art baselines in clustering metrics, underscoring the value of nonverbal cues for unsupervised multimodal semantics discovery and its potential in real-world human machine interactions.

Abstract

Discovering the semantics of multimodal utterances is essential for understanding human language and enhancing human-machine interactions. Existing methods manifest limitations in leveraging nonverbal information for discerning complex semantics in unsupervised scenarios. This paper introduces a novel unsupervised multimodal clustering method (UMC), making a pioneering contribution to this field. UMC introduces a unique approach to constructing augmentation views for multimodal data, which are then used to perform pre-training to establish well-initialized representations for subsequent clustering. An innovative strategy is proposed to dynamically select high-quality samples as guidance for representation learning, gauged by the density of each sample's nearest neighbors. Besides, it is equipped to automatically determine the optimal value for the top- parameter in each cluster to refine sample selection. Finally, both high- and low-quality samples are used to learn representations conducive to effective clustering. We build baselines on benchmark multimodal intent and dialogue act datasets. UMC shows remarkable improvements of 2-6\% scores in clustering metrics over state-of-the-art methods, marking the first successful endeavor in this domain. The complete code and data are available at https://github.com/thuiar/UMC.
Paper Structure (37 sections, 14 equations, 10 figures, 5 tables)

This paper contains 37 sections, 14 equations, 10 figures, 5 tables.

Figures (10)

  • Figure 1: Text-only clustering deviates from real multimodal utterance semantics, highlighting the need of multimodal information in semantics discovery.
  • Figure 2: Overview of our proposed unsupervised multimodal clustering algorithm UMC.
  • Figure 3: Pipeline of the high-quality sample selection mechanism.
  • Figure 4: Automatic vs. fixed $K_{\text{near}}$ selection strategy.
  • Figure 5: Visualization of representations on the IEMOCAP-DA dataset.
  • ...and 5 more figures