Unsupervised Multimodal Clustering for Semantics Discovery in Multimodal Utterances

Hanlei Zhang; Hua Xu; Fei Long; Xin Wang; Kai Gao

Unsupervised Multimodal Clustering for Semantics Discovery in Multimodal Utterances

Hanlei Zhang, Hua Xu, Fei Long, Xin Wang, Kai Gao

TL;DR

The paper tackles unsupervised semantics discovery for multimodal utterances by introducing UMC, a method that jointly leverages text, video, and audio through novel augmentation and high-quality sample selection. It constructs robust multimodal representations via a text-centric anchor with nonverbal masking augmentations, followed by density-driven selection of high-quality samples and an iterative learning scheme that alternates supervised and unsupervised contrastive objectives. Key contributions include a density-based curriculum for sample quality, automatic per-cluster determination of the top-k neighbors, and centroid inheritance to boost clustering quality. Empirical results on MIntRec, MELD-DA, and IEMOCAP-DA show consistent 2-6% improvements over state-of-the-art baselines in clustering metrics, underscoring the value of nonverbal cues for unsupervised multimodal semantics discovery and its potential in real-world human machine interactions.

Abstract

Discovering the semantics of multimodal utterances is essential for understanding human language and enhancing human-machine interactions. Existing methods manifest limitations in leveraging nonverbal information for discerning complex semantics in unsupervised scenarios. This paper introduces a novel unsupervised multimodal clustering method (UMC), making a pioneering contribution to this field. UMC introduces a unique approach to constructing augmentation views for multimodal data, which are then used to perform pre-training to establish well-initialized representations for subsequent clustering. An innovative strategy is proposed to dynamically select high-quality samples as guidance for representation learning, gauged by the density of each sample's nearest neighbors. Besides, it is equipped to automatically determine the optimal value for the top-$K$ parameter in each cluster to refine sample selection. Finally, both high- and low-quality samples are used to learn representations conducive to effective clustering. We build baselines on benchmark multimodal intent and dialogue act datasets. UMC shows remarkable improvements of 2-6\% scores in clustering metrics over state-of-the-art methods, marking the first successful endeavor in this domain. The complete code and data are available at https://github.com/thuiar/UMC.

Unsupervised Multimodal Clustering for Semantics Discovery in Multimodal Utterances

TL;DR

Abstract

parameter in each cluster to refine sample selection. Finally, both high- and low-quality samples are used to learn representations conducive to effective clustering. We build baselines on benchmark multimodal intent and dialogue act datasets. UMC shows remarkable improvements of 2-6\% scores in clustering metrics over state-of-the-art methods, marking the first successful endeavor in this domain. The complete code and data are available at https://github.com/thuiar/UMC.

Paper Structure (37 sections, 14 equations, 10 figures, 5 tables)

This paper contains 37 sections, 14 equations, 10 figures, 5 tables.

Introduction
Related Works
Unsupervised Clustering
Intent Discovery
Problem Formulation
Methodologies
Multimodal Representation
Multimodal Unsupervised Pre-training
Clustering and High-Quality Sample Selection
Density Calculation
High-Quality Sample Selection and Evaluation
Multimodal Representation Learning
Experiments
Datasets
Baselines
...and 22 more sections

Figures (10)

Figure 1: Text-only clustering deviates from real multimodal utterance semantics, highlighting the need of multimodal information in semantics discovery.
Figure 2: Overview of our proposed unsupervised multimodal clustering algorithm UMC.
Figure 3: Pipeline of the high-quality sample selection mechanism.
Figure 4: Automatic vs. fixed $K_{\text{near}}$ selection strategy.
Figure 5: Visualization of representations on the IEMOCAP-DA dataset.
...and 5 more figures

Unsupervised Multimodal Clustering for Semantics Discovery in Multimodal Utterances

TL;DR

Abstract

Unsupervised Multimodal Clustering for Semantics Discovery in Multimodal Utterances

Authors

TL;DR

Abstract

Table of Contents

Figures (10)