Table of Contents
Fetching ...

TopicDiff: A Topic-enriched Diffusion Approach for Multimodal Conversational Emotion Detection

Jiamin Luo, Jingjing Wang, Guodong Zhou

TL;DR

A model-agnostic Topic-enriched Diffusion (TopicDiff) approach for capturing multimodal topic information in MCE tasks that integrates the diffusion model into neural topic model to alleviate the diversity deficiency problem of neural topic model in capturing topic information.

Abstract

Multimodal Conversational Emotion (MCE) detection, generally spanning across the acoustic, vision and language modalities, has attracted increasing interest in the multimedia community. Previous studies predominantly focus on learning contextual information in conversations with only a few considering the topic information in single language modality, while always neglecting the acoustic and vision topic information. On this basis, we propose a model-agnostic Topic-enriched Diffusion (TopicDiff) approach for capturing multimodal topic information in MCE tasks. Particularly, we integrate the diffusion model into neural topic model to alleviate the diversity deficiency problem of neural topic model in capturing topic information. Detailed evaluations demonstrate the significant improvements of TopicDiff over the state-of-the-art MCE baselines, justifying the importance of multimodal topic information to MCE and the effectiveness of TopicDiff in capturing such information. Furthermore, we observe an interesting finding that the topic information in acoustic and vision is more discriminative and robust compared to the language.

TopicDiff: A Topic-enriched Diffusion Approach for Multimodal Conversational Emotion Detection

TL;DR

A model-agnostic Topic-enriched Diffusion (TopicDiff) approach for capturing multimodal topic information in MCE tasks that integrates the diffusion model into neural topic model to alleviate the diversity deficiency problem of neural topic model in capturing topic information.

Abstract

Multimodal Conversational Emotion (MCE) detection, generally spanning across the acoustic, vision and language modalities, has attracted increasing interest in the multimedia community. Previous studies predominantly focus on learning contextual information in conversations with only a few considering the topic information in single language modality, while always neglecting the acoustic and vision topic information. On this basis, we propose a model-agnostic Topic-enriched Diffusion (TopicDiff) approach for capturing multimodal topic information in MCE tasks. Particularly, we integrate the diffusion model into neural topic model to alleviate the diversity deficiency problem of neural topic model in capturing topic information. Detailed evaluations demonstrate the significant improvements of TopicDiff over the state-of-the-art MCE baselines, justifying the importance of multimodal topic information to MCE and the effectiveness of TopicDiff in capturing such information. Furthermore, we observe an interesting finding that the topic information in acoustic and vision is more discriminative and robust compared to the language.
Paper Structure (16 sections, 7 equations, 4 figures, 3 tables)

This paper contains 16 sections, 7 equations, 4 figures, 3 tables.

Figures (4)

  • Figure 1: A multimodal conversational example from MELD dataset to illustrate the importance of multimodal topic information, where each utterance contains acoustic spectrum, video frame, language and corresponding emotion label.
  • Figure 2: The overall architecture of our model-agnostic Topic-enriched Diffusion (TopicDiff) approach for MCE, where TDB represents Topic-enriched Diffusion Block consisting of Topic-enriched Diffusion Process and Topic-enriched Denoising Process.
  • Figure 3: Four line charts to study the robustness of our TopicDiff approach and different modal topic information with the change of topic-density degree via maintaining the total number of training set and varying the numbers of TV series, where the x-axis represents the numbers of TV series. Line charts of (a) and (b) show the performance trend on two MCE approaches (with/without TopicDiff). And line charts of (c) and (d) illustrate the performance trend on acoustic, vision and language topic information via two MCE approaches with TopicDiff. All the line charts are conducted on our constructed topic-density M3ED$^*$ dataset and evaluated on W-F1 metric.
  • Figure 4: A multimodal conversational sample includes utterances comprising acoustic spectrums, video frames, language, and ground-truth emotions, alongside the probabilities of ground-truth emotion joy on V4 predicted by various approaches. Language/Acoustic/Vision Topic denotes the utilization of TopicDiff to capture the corresponding modal topic information.