Table of Contents
Fetching ...

Joint Multimodal Transformer for Emotion Recognition in the Wild

Paul Waligora, Haseeb Aslam, Osama Zeeshan, Soufiane Belharbi, Alessandro Lameiras Koerich, Marco Pedersoli, Simon Bacon, Eric Granger

TL;DR

This work tackles dimensional emotion recognition in-the-wild by proposing a Joint Multimodal Transformer (JMT) that fuses visual and audio cues via key-based cross-attention while introducing a third joint modality representation. Modality-specific backbones extract intra-modal spatiotemporal features, which are concatenated into a joint representation and processed through a Joint Transformer Module (JTM) to model inter- and intra-modal relationships; the model is trained with a Concordance Correlation Coefficient (CCC) loss $L_c = 1 - \rho_c = 1 - \frac{2\rho_{xy}^2}{\rho_x^2 + \rho_y^2 + (\mu_x - \mu_y)^2}$. Experiments on Affwild2 (valence/arousal) and BioVid (pain) show that JMT outperforms baselines and achieves state-of-the-art results on BioVid, while delivering notable gains on Affwild2 compared to vanilla fusion methods. The approach demonstrates robustness to noisy modalities through the joint representation, offering a cost-effective and scalable solution for real-world MMER tasks. The work suggests extending JMT with more modalities and stronger backbones to further enhance expressive power in-the-wild scenarios.

Abstract

Multimodal emotion recognition (MMER) systems typically outperform unimodal systems by leveraging the inter- and intra-modal relationships between, e.g., visual, textual, physiological, and auditory modalities. This paper proposes an MMER method that relies on a joint multimodal transformer (JMT) for fusion with key-based cross-attention. This framework can exploit the complementary nature of diverse modalities to improve predictive accuracy. Separate backbones capture intra-modal spatiotemporal dependencies within each modality over video sequences. Subsequently, our JMT fusion architecture integrates the individual modality embeddings, allowing the model to effectively capture inter- and intra-modal relationships. Extensive experiments on two challenging expression recognition tasks -- (1) dimensional emotion recognition on the Affwild2 dataset (with face and voice) and (2) pain estimation on the Biovid dataset (with face and biosensors) -- indicate that our JMT fusion can provide a cost-effective solution for MMER. Empirical results show that MMER systems with our proposed fusion allow us to outperform relevant baseline and state-of-the-art methods.

Joint Multimodal Transformer for Emotion Recognition in the Wild

TL;DR

This work tackles dimensional emotion recognition in-the-wild by proposing a Joint Multimodal Transformer (JMT) that fuses visual and audio cues via key-based cross-attention while introducing a third joint modality representation. Modality-specific backbones extract intra-modal spatiotemporal features, which are concatenated into a joint representation and processed through a Joint Transformer Module (JTM) to model inter- and intra-modal relationships; the model is trained with a Concordance Correlation Coefficient (CCC) loss . Experiments on Affwild2 (valence/arousal) and BioVid (pain) show that JMT outperforms baselines and achieves state-of-the-art results on BioVid, while delivering notable gains on Affwild2 compared to vanilla fusion methods. The approach demonstrates robustness to noisy modalities through the joint representation, offering a cost-effective and scalable solution for real-world MMER tasks. The work suggests extending JMT with more modalities and stronger backbones to further enhance expressive power in-the-wild scenarios.

Abstract

Multimodal emotion recognition (MMER) systems typically outperform unimodal systems by leveraging the inter- and intra-modal relationships between, e.g., visual, textual, physiological, and auditory modalities. This paper proposes an MMER method that relies on a joint multimodal transformer (JMT) for fusion with key-based cross-attention. This framework can exploit the complementary nature of diverse modalities to improve predictive accuracy. Separate backbones capture intra-modal spatiotemporal dependencies within each modality over video sequences. Subsequently, our JMT fusion architecture integrates the individual modality embeddings, allowing the model to effectively capture inter- and intra-modal relationships. Extensive experiments on two challenging expression recognition tasks -- (1) dimensional emotion recognition on the Affwild2 dataset (with face and voice) and (2) pain estimation on the Biovid dataset (with face and biosensors) -- indicate that our JMT fusion can provide a cost-effective solution for MMER. Empirical results show that MMER systems with our proposed fusion allow us to outperform relevant baseline and state-of-the-art methods.
Paper Structure (18 sections, 3 equations, 4 figures, 7 tables)

This paper contains 18 sections, 3 equations, 4 figures, 7 tables.

Figures (4)

  • Figure 1: (Top) An illustration of the vanilla multimodal transformer fusion architecture in the case of two input sources, A and B. (Bottom) Our proposed JMT fusion (in red) relies on joint multimodal representations.
  • Figure 2: An overview of the proposed joint multimodal transformer model for A-V fusion. The audio and visual modalities are cross-attended using transformer blocks. The JMT block also takes in the joint representation (shown with red arrows). The output of the cross-attended features is concatenated, and an FC layer is used for valence/arousal prediction.
  • Figure 3: Illustration of the proposed joint multimodal transformer architecture used for the Biovid pain estimation task. The blue branch shows the visual backbone, and the yellow branch is the physiological backbone. The joint representation is shown with a red block. The three feature vectors are fed into the joint transformer block.
  • Figure 4: Visualization of attention weights for visual and physiological modalities on the Biovid heat pain database. The facial frames are taken 1400 msec each.