Asynchronous Multimodal Video Sequence Fusion via Learning Modality-Exclusive and -Agnostic Representations

Dingkang Yang; Mingcheng Li; Linhao Qu; Kun Yang; Peng Zhai; Song Wang; Lihua Zhang

Asynchronous Multimodal Video Sequence Fusion via Learning Modality-Exclusive and -Agnostic Representations

Dingkang Yang, Mingcheng Li, Linhao Qu, Kun Yang, Peng Zhai, Song Wang, Lihua Zhang

TL;DR

A Multimodal fusion approach for learning modality-Exclusive and modality-Agnostic representations (MEA) is proposed to refine multimodal features and leverage the complementarity across distinct modalities to refine multimodal features and leverage the complementarity across distinct modalities.

Abstract

Understanding human intentions (e.g., emotions) from videos has received considerable attention recently. Video streams generally constitute a blend of temporal data stemming from distinct modalities, including natural language, facial expressions, and auditory clues. Despite the impressive advancements of previous works via attention-based paradigms, the inherent temporal asynchrony and modality heterogeneity challenges remain in multimodal sequence fusion, causing adverse performance bottlenecks. To tackle these issues, we propose a Multimodal fusion approach for learning modality-Exclusive and modality-Agnostic representations (MEA) to refine multimodal features and leverage the complementarity across distinct modalities. On the one hand, MEA introduces a predictive self-attention module to capture reliable context dynamics within modalities and reinforce unique features over the modality-exclusive spaces. On the other hand, a hierarchical cross-modal attention module is designed to explore valuable element correlations among modalities over the modality-agnostic space. Meanwhile, a double-discriminator strategy is presented to ensure the production of distinct representations in an adversarial manner. Eventually, we propose a decoupled graph fusion mechanism to enhance knowledge exchange across heterogeneous modalities and learn robust multimodal representations for downstream tasks. Numerous experiments are implemented on three multimodal datasets with asynchronous sequences. Systematic analyses show the necessity of our approach.

Asynchronous Multimodal Video Sequence Fusion via Learning Modality-Exclusive and -Agnostic Representations

TL;DR

Abstract

Paper Structure (37 sections, 16 equations, 9 figures, 5 tables)

This paper contains 37 sections, 16 equations, 9 figures, 5 tables.

Introduction
Related Work
Multimodal Video Computing
Multimodal Sequence Fusion
Multimodal Representation Learning
Methodology
Model Overview
Uni-modal Extractor
Predictive Self-Attention Module
Hierarchical Cross-Modal Attention Module
Decoupled Representation Learning
Modality-Exclusive and -Agnostic Representations
Disparity Constraint
Double-Discriminator Adversarial Strategy
Decoupled Graph Fusion Mechanism
...and 22 more sections

Figures (9)

Figure 1: The overall architecture of the proposed Multimodal fusion approach for learning modality-Exclusive and modality-Agnostic representations (MEA). "PSA" represents a predictive self-attention module. "HCA" represents a hierarchical cross-modal attention module.
Figure 2: (a) The overall structure of the Predictive Self-Attention (PSA) modules. We provide a pipeline of two-layer PSA modules from three modalities to illustrate how the predictive attention map and weighted attention layer work. (b) The overall structure of a Modality Reinforcement Unit (MRU) in the HCA module. (c) The overall structure of the Hierarchical Cross-modal Attention (HCA) modules.
Figure 3: We show the attention matrix activations from (a) the vanilla self-attention vaswani2017attention and (b) the proposed PSA module in the language modality. Compared to vanilla self-attention, our PSA module captures more meaningful attention correlations across elements within the modality.
Figure 4: We show the cross-modal attention matrix activations of (a) the proposed HCA module and (b) the SOTA method DMD li2023decoupled on the MOSEI dataset. The spoken words closely related to the expression of human emotions are marked in red. Compared to the DMD, our model learns more reliable element correlations between different modalities. For example, stronger attention weights are focused on the intersection regions of cross-modal elements on asynchronous sequences between spoken words ("conclusions") and video frames ("gnashing with grimace"), which usually suggest salient emotion clues.
Figure 5: We randomly select 200 samples in the testing set on the three datasets to visualize modality-agnostic and -exclusive representations. $\alpha = 0, \beta =0$ denotes without disparity and adversarial constraints, and vice versa. The red, orange, and green colours correspond to agnostic parts. The pink, yellow, and blue colours correspond to the exclusive parts.
...and 4 more figures

Asynchronous Multimodal Video Sequence Fusion via Learning Modality-Exclusive and -Agnostic Representations

TL;DR

Abstract

Asynchronous Multimodal Video Sequence Fusion via Learning Modality-Exclusive and -Agnostic Representations

Authors

TL;DR

Abstract

Table of Contents

Figures (9)