Joint Multimodal Transformer for Emotion Recognition in the Wild
Paul Waligora, Haseeb Aslam, Osama Zeeshan, Soufiane Belharbi, Alessandro Lameiras Koerich, Marco Pedersoli, Simon Bacon, Eric Granger
TL;DR
This work tackles dimensional emotion recognition in-the-wild by proposing a Joint Multimodal Transformer (JMT) that fuses visual and audio cues via key-based cross-attention while introducing a third joint modality representation. Modality-specific backbones extract intra-modal spatiotemporal features, which are concatenated into a joint representation and processed through a Joint Transformer Module (JTM) to model inter- and intra-modal relationships; the model is trained with a Concordance Correlation Coefficient (CCC) loss $L_c = 1 - \rho_c = 1 - \frac{2\rho_{xy}^2}{\rho_x^2 + \rho_y^2 + (\mu_x - \mu_y)^2}$. Experiments on Affwild2 (valence/arousal) and BioVid (pain) show that JMT outperforms baselines and achieves state-of-the-art results on BioVid, while delivering notable gains on Affwild2 compared to vanilla fusion methods. The approach demonstrates robustness to noisy modalities through the joint representation, offering a cost-effective and scalable solution for real-world MMER tasks. The work suggests extending JMT with more modalities and stronger backbones to further enhance expressive power in-the-wild scenarios.
Abstract
Multimodal emotion recognition (MMER) systems typically outperform unimodal systems by leveraging the inter- and intra-modal relationships between, e.g., visual, textual, physiological, and auditory modalities. This paper proposes an MMER method that relies on a joint multimodal transformer (JMT) for fusion with key-based cross-attention. This framework can exploit the complementary nature of diverse modalities to improve predictive accuracy. Separate backbones capture intra-modal spatiotemporal dependencies within each modality over video sequences. Subsequently, our JMT fusion architecture integrates the individual modality embeddings, allowing the model to effectively capture inter- and intra-modal relationships. Extensive experiments on two challenging expression recognition tasks -- (1) dimensional emotion recognition on the Affwild2 dataset (with face and voice) and (2) pain estimation on the Biovid dataset (with face and biosensors) -- indicate that our JMT fusion can provide a cost-effective solution for MMER. Empirical results show that MMER systems with our proposed fusion allow us to outperform relevant baseline and state-of-the-art methods.
