Table of Contents
Fetching ...

Interactive Multimodal Fusion with Temporal Modeling

Jun Yu, Yongqi Wang, Lei Wang, Yang Zheng, Shengfan Xu

TL;DR

This work tackles valence-arousal estimation in unconstrained, real-world settings by fusing visual and audio information through a modular multimodal architecture. A visual branch based on a pre-trained ResNet, two audio branches using VGGish and LogMel features, and multi-scale Temporal Convolutional Networks model temporal dynamics; cross-modal attention enables effective fusion before a regression head predicts VA. The approach is trained in stages, first pre-training components on large datasets and then fine-tuning end-to-end on Aff-Wild2, achieving improvements over baselines on six validation folds. The method advances VA estimation in-the-wild by leveraging complementary cues across modalities, robust temporal modeling, and targeted fusion, with potential impact on affective computing applications in HCI and beyond.

Abstract

This paper presents our method for the estimation of valence-arousal (VA) in the 8th Affective Behavior Analysis in-the-Wild (ABAW) competition. Our approach integrates visual and audio information through a multimodal framework. The visual branch uses a pre-trained ResNet model to extract spatial features from facial images. The audio branches employ pre-trained VGG models to extract VGGish and LogMel features from speech signals. These features undergo temporal modeling using Temporal Convolutional Networks (TCNs). We then apply cross-modal attention mechanisms, where visual features interact with audio features through query-key-value attention structures. Finally, the features are concatenated and passed through a regression layer to predict valence and arousal. Our method achieves competitive performance on the Aff-Wild2 dataset, demonstrating effective multimodal fusion for VA estimation in-the-wild.

Interactive Multimodal Fusion with Temporal Modeling

TL;DR

This work tackles valence-arousal estimation in unconstrained, real-world settings by fusing visual and audio information through a modular multimodal architecture. A visual branch based on a pre-trained ResNet, two audio branches using VGGish and LogMel features, and multi-scale Temporal Convolutional Networks model temporal dynamics; cross-modal attention enables effective fusion before a regression head predicts VA. The approach is trained in stages, first pre-training components on large datasets and then fine-tuning end-to-end on Aff-Wild2, achieving improvements over baselines on six validation folds. The method advances VA estimation in-the-wild by leveraging complementary cues across modalities, robust temporal modeling, and targeted fusion, with potential impact on affective computing applications in HCI and beyond.

Abstract

This paper presents our method for the estimation of valence-arousal (VA) in the 8th Affective Behavior Analysis in-the-Wild (ABAW) competition. Our approach integrates visual and audio information through a multimodal framework. The visual branch uses a pre-trained ResNet model to extract spatial features from facial images. The audio branches employ pre-trained VGG models to extract VGGish and LogMel features from speech signals. These features undergo temporal modeling using Temporal Convolutional Networks (TCNs). We then apply cross-modal attention mechanisms, where visual features interact with audio features through query-key-value attention structures. Finally, the features are concatenated and passed through a regression layer to predict valence and arousal. Our method achieves competitive performance on the Aff-Wild2 dataset, demonstrating effective multimodal fusion for VA estimation in-the-wild.

Paper Structure

This paper contains 14 sections, 16 equations, 1 figure, 2 tables.

Figures (1)

  • Figure 1: Our proposed framework for VA estimation.