Detail-Enhanced Intra- and Inter-modal Interaction for Audio-Visual Emotion Recognition
Tong Shi, Xuri Ge, Joemon M. Jose, Nicolas Pugeault, Paul Henderson
TL;DR
The paper tackles Audio-Visual Emotion Recognition (AVER) by addressing two core gaps: capturing fine-grained intra-modal facial details and effectively leveraging inter-modal correlations. It introduces DE-III, a detail-enhanced framework that augments video with explicit optical-flow texture information, uses a pairwise OV attention fusion to integrate frame and flow features, and employs an inter-modal feature enhancement module for attentive cross-modal fusion, all trained with three heads using losses $L_V$, $L_A$, and $L_F$ and $\ ext{L}_{CCC}$ for continuous labels. The approach combines Conformer-based encoders for both audio and video streams, a dedicated video OV fusion stage, and residual inter-modal connections, achieving state-of-the-art results on CREMA-D, MSP-IMPROV, and RAVDESS. Qualitative analyses further reveal adaptive inter-modal attention patterns, supporting the model’s ability to selectively fuse informative cues across modalities and time. Overall, DE-III advances robust AVER by enhancing visual texture representation and refining cross-modal interactions, with reproducible results and demonstrated practical impact on widely used benchmarks.
Abstract
Capturing complex temporal relationships between video and audio modalities is vital for Audio-Visual Emotion Recognition (AVER). However, existing methods lack attention to local details, such as facial state changes between video frames, which can reduce the discriminability of features and thus lower recognition accuracy. In this paper, we propose a Detail-Enhanced Intra- and Inter-modal Interaction network(DE-III) for AVER, incorporating several novel aspects. We introduce optical flow information to enrich video representations with texture details that better capture facial state changes. A fusion module integrates the optical flow estimation with the corresponding video frames to enhance the representation of facial texture variations. We also design attentive intra- and inter-modal feature enhancement modules to further improve the richness and discriminability of video and audio representations. A detailed quantitative evaluation shows that our proposed model outperforms all existing methods on three benchmark datasets for both concrete and continuous emotion recognition. To encourage further research and ensure replicability, we will release our full code upon acceptance.
