Table of Contents
Fetching ...

Semantic Matters: Multimodal Features for Affective Analysis

Tobias Hallmen, Robin-Nico Kampa, Fabian Deuser, Norbert Oswald, Elisabeth André

TL;DR

The paper tackles EMI and BAH under ABAW 8th by introducing a multimodal framework that fuses vision (ViT-Huge), audio (Wav2Vec 2.0 with VAD), and text (Whisper+GTE) to exploit semantic content for affective analysis. Each modality is temporally modeled (LSTMs) and fused via a two-layer MLP, with EMI trained on 12-second audio chunks and 128-token text representations, and BAH using a context-aware convolution-like approach with 20-second text chunks. The approach achieves $ρ_{ ext{TEST}} = 0.706$ for EMI and $F1_{ ext{TEST}} = 0.702$ for BAH, securing first and second place, respectively, highlighting the strong role of textual semantics in multimodal affective prediction. The findings underscore the value of semantic-rich text in multimodal affective analysis and point to future work on fusion strategies and semantic-aware audio representations to further improve in-the-wild performance.

Abstract

In this study, we present our methodology for two tasks: the Emotional Mimicry Intensity (EMI) Estimation Challenge and the Behavioural Ambivalence/Hesitancy (BAH) Recognition Challenge, both conducted as part of the 8th Workshop and Competition on Affective & Behavior Analysis in-the-wild. We utilize a Wav2Vec 2.0 model pre-trained on a large podcast dataset to extract various audio features, capturing both linguistic and paralinguistic information. Our approach incorporates a valence-arousal-dominance (VAD) module derived from Wav2Vec 2.0, a BERT text encoder, and a vision transformer (ViT) with predictions subsequently processed through a long short-term memory (LSTM) architecture or a convolution-like method for temporal modeling. We integrate the textual and visual modality into our analysis, recognizing that semantic content provides valuable contextual cues and underscoring that the meaning of speech often conveys more critical insights than its acoustic counterpart alone. Fusing in the vision modality helps in some cases to interpret the textual modality more precisely. This combined approach results in significant performance improvements, achieving in EMI $ρ_{\text{TEST}} = 0.706$ and in BAH $F1_{\text{TEST}} = 0.702$, securing first place in the EMI challenge and second place in the BAH challenge.

Semantic Matters: Multimodal Features for Affective Analysis

TL;DR

The paper tackles EMI and BAH under ABAW 8th by introducing a multimodal framework that fuses vision (ViT-Huge), audio (Wav2Vec 2.0 with VAD), and text (Whisper+GTE) to exploit semantic content for affective analysis. Each modality is temporally modeled (LSTMs) and fused via a two-layer MLP, with EMI trained on 12-second audio chunks and 128-token text representations, and BAH using a context-aware convolution-like approach with 20-second text chunks. The approach achieves for EMI and for BAH, securing first and second place, respectively, highlighting the strong role of textual semantics in multimodal affective prediction. The findings underscore the value of semantic-rich text in multimodal affective analysis and point to future work on fusion strategies and semantic-aware audio representations to further improve in-the-wild performance.

Abstract

In this study, we present our methodology for two tasks: the Emotional Mimicry Intensity (EMI) Estimation Challenge and the Behavioural Ambivalence/Hesitancy (BAH) Recognition Challenge, both conducted as part of the 8th Workshop and Competition on Affective & Behavior Analysis in-the-wild. We utilize a Wav2Vec 2.0 model pre-trained on a large podcast dataset to extract various audio features, capturing both linguistic and paralinguistic information. Our approach incorporates a valence-arousal-dominance (VAD) module derived from Wav2Vec 2.0, a BERT text encoder, and a vision transformer (ViT) with predictions subsequently processed through a long short-term memory (LSTM) architecture or a convolution-like method for temporal modeling. We integrate the textual and visual modality into our analysis, recognizing that semantic content provides valuable contextual cues and underscoring that the meaning of speech often conveys more critical insights than its acoustic counterpart alone. Fusing in the vision modality helps in some cases to interpret the textual modality more precisely. This combined approach results in significant performance improvements, achieving in EMI and in BAH , securing first place in the EMI challenge and second place in the BAH challenge.

Paper Structure

This paper contains 14 sections, 2 equations, 4 figures, 6 tables.

Figures (4)

  • Figure 1: Architecture overview of our approach. We first pre-process the cropped face images and transcribe the audio. Afterwards we process each modality independently and then fuse it in our fusion module.
  • Figure 2: Using chunks from different modalities. The chunk sizes are chosen differently for each modality, based on achieving optimal performance on the validation split. The available chunk sizes and their variations are depicted as colored boxes with arrows in the figure. The data shown are taken from the BAH challenge dataset.
  • Figure 3: Comparison of the different modalities' performance on the validation split. Despite being derived from audio, text shows best performance. Vision is the weakest performing modality.
  • Figure 4: Qualitative example of our predictions with our best model on the BAH task. While the model detects ambivalence and hesitation, it sometimes shows single-frame positives where there are none. On the other hand, later context is not correctly classified.