Multimodal Emotion Regression with Multi-Objective Optimization and VAD-Aware Audio Modeling for the 10th ABAW EMI Track

Jiawen Huang; Chenxi Huang; Zhuofan Wen; Hailiang Yao; Shun Chen; Longjiang Yang; Cong Yu; Fengyu Zhang; Ran Liu; Bin Liu

Multimodal Emotion Regression with Multi-Objective Optimization and VAD-Aware Audio Modeling for the 10th ABAW EMI Track

Jiawen Huang, Chenxi Huang, Zhuofan Wen, Hailiang Yao, Shun Chen, Longjiang Yang, Cong Yu, Fengyu Zhang, Ran Liu, Bin Liu

Abstract

We participated in the 10th ABAW Challenge, focusing on the Emotional Mimicry Intensity (EMI) Estimation track on the Hume-Vidmimic2 dataset. This task aims to predict six continuous emotion dimensions: Admiration, Amusement, Determination, Empathic Pain, Excitement, and Joy. Through systematic multimodal exploration of pretrained high-level features, we found that, under our pretrained feature setting, direct feature concatenation outperformed the more complex fusion strategies we tested. This empirical finding motivated us to design a systematic approach built upon three core principles: (i) preserving modality-specific attributes through feature-level concatenation; (ii) improving training stability and metric alignment via multi-objective optimization; and (iii) enriching acoustic representations with a VAD-inspired latent prior. Our final framework integrates concatenation-based multimodal fusion, a shared six-dimensional regression head, multi-objective optimization with MSE, Pearson-correlation, and auxiliary branch supervision, EMA for parameter stabilization, and a VAD-inspired latent prior for the acoustic branch. On the official validation set, the proposed scheme achieved our best mean Pearson Correlation Coefficient of 0.478567.

Multimodal Emotion Regression with Multi-Objective Optimization and VAD-Aware Audio Modeling for the 10th ABAW EMI Track

Abstract

Paper Structure (15 sections, 11 equations, 2 figures, 4 tables)

This paper contains 15 sections, 11 equations, 2 figures, 4 tables.

Introduction
Related Work
Method
Problem Formulation
Overall Architecture
Feature-Concat Multimodal Regression
VAD-Aware Audio Modeling
Multi-Objective Optimization
Experiments
Dataset and Evaluation Metric
Implementation Details
Main Results
Single-modal analysis
Ablation Study
Conclusion

Figures (2)

Figure 1: Architecture of the proposed multimodal framework. The model employs three modality-specific branches with auxiliary supervision, VAD-Aware audio fusion, feature concatenation, and a shared six-dimensional regression head for emotional mimicry intensity estimation.
Figure 2: Overview of the VAD-aware audio representation learning module. After audio linear projection, the acoustic feature is split into a main audio stream and a VAD prediction branch. The VAD branch produces a 3D latent VAD representation, which is combined with the main audio stream to construct the audio embedding. This embedding is then used for multimodal fusion and auxiliary loss, while the latent VAD representation is regularized to improve stability and prevent extreme deviations.

Multimodal Emotion Regression with Multi-Objective Optimization and VAD-Aware Audio Modeling for the 10th ABAW EMI Track

Abstract

Multimodal Emotion Regression with Multi-Objective Optimization and VAD-Aware Audio Modeling for the 10th ABAW EMI Track

Authors

Abstract

Table of Contents

Figures (2)