Table of Contents
Fetching ...

End-to-end Semantic-centric Video-based Multimodal Affective Computing

Ronghao Lin, Ying Zeng, Sijie Mai, Haifeng Hu

TL;DR

SemanticMAC addresses the challenge of end-to-end multimodal affective computing by learning semantic-centric representations across textual, acoustic, and visual modalities. It introduces an Affective Perceiver for unimodal refinement, and semantic-centric modules (SGFI,SCLG,SCCL) to produce modality-specific and shared semantics, guided by pseudo labels and contrastive losses. The approach delivers state-of-the-art results on seven public datasets spanning sentiment analysis, emotion recognition, and humor/sarcasm detection, while mitigating semantic imbalance and mismatch without relying on handcrafted features. The framework demonstrates robustness to varying video lengths and generalizes across language models, signaling strong practical impact for real-world, end-to-end MAC systems.

Abstract

In the pathway toward Artificial General Intelligence (AGI), understanding human's affection is essential to enhance machine's cognition abilities. For achieving more sensual human-AI interaction, Multimodal Affective Computing (MAC) in human-spoken videos has attracted increasing attention. However, previous methods are mainly devoted to designing multimodal fusion algorithms, suffering from two issues: semantic imbalance caused by diverse pre-processing operations and semantic mismatch raised by inconsistent affection content contained in different modalities comparing with the multimodal ground truth. Besides, the usage of manual features extractors make they fail in building end-to-end pipeline for multiple MAC downstream tasks. To address above challenges, we propose a novel end-to-end framework named SemanticMAC to compute multimodal semantic-centric affection for human-spoken videos. We firstly employ pre-trained Transformer model in multimodal data pre-processing and design Affective Perceiver module to capture unimodal affective information. Moreover, we present a semantic-centric approach to unify multimodal representation learning in three ways, including gated feature interaction, multi-task pseudo label generation, and intra-/inter-sample contrastive learning. Finally, SemanticMAC effectively learn specific- and shared-semantic representations in the guidance of semantic-centric labels. Extensive experimental results demonstrate that our approach surpass the state-of-the-art methods on 7 public datasets in four MAC downstream tasks.

End-to-end Semantic-centric Video-based Multimodal Affective Computing

TL;DR

SemanticMAC addresses the challenge of end-to-end multimodal affective computing by learning semantic-centric representations across textual, acoustic, and visual modalities. It introduces an Affective Perceiver for unimodal refinement, and semantic-centric modules (SGFI,SCLG,SCCL) to produce modality-specific and shared semantics, guided by pseudo labels and contrastive losses. The approach delivers state-of-the-art results on seven public datasets spanning sentiment analysis, emotion recognition, and humor/sarcasm detection, while mitigating semantic imbalance and mismatch without relying on handcrafted features. The framework demonstrates robustness to varying video lengths and generalizes across language models, signaling strong practical impact for real-world, end-to-end MAC systems.

Abstract

In the pathway toward Artificial General Intelligence (AGI), understanding human's affection is essential to enhance machine's cognition abilities. For achieving more sensual human-AI interaction, Multimodal Affective Computing (MAC) in human-spoken videos has attracted increasing attention. However, previous methods are mainly devoted to designing multimodal fusion algorithms, suffering from two issues: semantic imbalance caused by diverse pre-processing operations and semantic mismatch raised by inconsistent affection content contained in different modalities comparing with the multimodal ground truth. Besides, the usage of manual features extractors make they fail in building end-to-end pipeline for multiple MAC downstream tasks. To address above challenges, we propose a novel end-to-end framework named SemanticMAC to compute multimodal semantic-centric affection for human-spoken videos. We firstly employ pre-trained Transformer model in multimodal data pre-processing and design Affective Perceiver module to capture unimodal affective information. Moreover, we present a semantic-centric approach to unify multimodal representation learning in three ways, including gated feature interaction, multi-task pseudo label generation, and intra-/inter-sample contrastive learning. Finally, SemanticMAC effectively learn specific- and shared-semantic representations in the guidance of semantic-centric labels. Extensive experimental results demonstrate that our approach surpass the state-of-the-art methods on 7 public datasets in four MAC downstream tasks.
Paper Structure (30 sections, 18 equations, 6 figures, 12 tables)

This paper contains 30 sections, 18 equations, 6 figures, 12 tables.

Figures (6)

  • Figure 1: The two main challenges in conducting multimodal affective computing from the perspective of semantic.
  • Figure 2: The PR curve of the fusion multimodal representations and the unimodal representations, including text, audio and vision modalities by state-of-the-art models training with Glove pennington2014glove and BERT devlin2019bert features on CMU-MOSEI dataset. Note that such PR curve is initially proposed as an evaluation metric for genrative models by Sajjadi et al.sajjadi2018assessing to formulate the relative probability densities of the distributions of real and generated data.
  • Figure 3: The overall architecture of the proposed SemanticMAC. Note that the frame embeddings and modality embeddings are updated during the stage of training while then fixed and generalized into the downstream inference.
  • Figure 4: The designed Affective Perceiver to learn affective unimodal features of acoustic and visual modalities.
  • Figure 5: The proposed Semantic-centric Gated Feature Interaction module.
  • ...and 1 more figures