Table of Contents
Fetching ...

Detail-Enhanced Intra- and Inter-modal Interaction for Audio-Visual Emotion Recognition

Tong Shi, Xuri Ge, Joemon M. Jose, Nicolas Pugeault, Paul Henderson

TL;DR

The paper tackles Audio-Visual Emotion Recognition (AVER) by addressing two core gaps: capturing fine-grained intra-modal facial details and effectively leveraging inter-modal correlations. It introduces DE-III, a detail-enhanced framework that augments video with explicit optical-flow texture information, uses a pairwise OV attention fusion to integrate frame and flow features, and employs an inter-modal feature enhancement module for attentive cross-modal fusion, all trained with three heads using losses $L_V$, $L_A$, and $L_F$ and $\ ext{L}_{CCC}$ for continuous labels. The approach combines Conformer-based encoders for both audio and video streams, a dedicated video OV fusion stage, and residual inter-modal connections, achieving state-of-the-art results on CREMA-D, MSP-IMPROV, and RAVDESS. Qualitative analyses further reveal adaptive inter-modal attention patterns, supporting the model’s ability to selectively fuse informative cues across modalities and time. Overall, DE-III advances robust AVER by enhancing visual texture representation and refining cross-modal interactions, with reproducible results and demonstrated practical impact on widely used benchmarks.

Abstract

Capturing complex temporal relationships between video and audio modalities is vital for Audio-Visual Emotion Recognition (AVER). However, existing methods lack attention to local details, such as facial state changes between video frames, which can reduce the discriminability of features and thus lower recognition accuracy. In this paper, we propose a Detail-Enhanced Intra- and Inter-modal Interaction network(DE-III) for AVER, incorporating several novel aspects. We introduce optical flow information to enrich video representations with texture details that better capture facial state changes. A fusion module integrates the optical flow estimation with the corresponding video frames to enhance the representation of facial texture variations. We also design attentive intra- and inter-modal feature enhancement modules to further improve the richness and discriminability of video and audio representations. A detailed quantitative evaluation shows that our proposed model outperforms all existing methods on three benchmark datasets for both concrete and continuous emotion recognition. To encourage further research and ensure replicability, we will release our full code upon acceptance.

Detail-Enhanced Intra- and Inter-modal Interaction for Audio-Visual Emotion Recognition

TL;DR

The paper tackles Audio-Visual Emotion Recognition (AVER) by addressing two core gaps: capturing fine-grained intra-modal facial details and effectively leveraging inter-modal correlations. It introduces DE-III, a detail-enhanced framework that augments video with explicit optical-flow texture information, uses a pairwise OV attention fusion to integrate frame and flow features, and employs an inter-modal feature enhancement module for attentive cross-modal fusion, all trained with three heads using losses , , and and for continuous labels. The approach combines Conformer-based encoders for both audio and video streams, a dedicated video OV fusion stage, and residual inter-modal connections, achieving state-of-the-art results on CREMA-D, MSP-IMPROV, and RAVDESS. Qualitative analyses further reveal adaptive inter-modal attention patterns, supporting the model’s ability to selectively fuse informative cues across modalities and time. Overall, DE-III advances robust AVER by enhancing visual texture representation and refining cross-modal interactions, with reproducible results and demonstrated practical impact on widely used benchmarks.

Abstract

Capturing complex temporal relationships between video and audio modalities is vital for Audio-Visual Emotion Recognition (AVER). However, existing methods lack attention to local details, such as facial state changes between video frames, which can reduce the discriminability of features and thus lower recognition accuracy. In this paper, we propose a Detail-Enhanced Intra- and Inter-modal Interaction network(DE-III) for AVER, incorporating several novel aspects. We introduce optical flow information to enrich video representations with texture details that better capture facial state changes. A fusion module integrates the optical flow estimation with the corresponding video frames to enhance the representation of facial texture variations. We also design attentive intra- and inter-modal feature enhancement modules to further improve the richness and discriminability of video and audio representations. A detailed quantitative evaluation shows that our proposed model outperforms all existing methods on three benchmark datasets for both concrete and continuous emotion recognition. To encourage further research and ensure replicability, we will release our full code upon acceptance.
Paper Structure (20 sections, 3 equations, 2 figures, 4 tables)

This paper contains 20 sections, 3 equations, 2 figures, 4 tables.

Figures (2)

  • Figure 1: Overview of our proposed method DE-III. Given video frames $v_i$ and audio fragments $a_i$, we extract features and pass these through separate Conformer encoders. We introduce explicit information about facial motions -- captured by optical flow $o_i$ -- to enhance video feature representations, with a new pair-wise O-V attention fusion module that effectively integrates the information from optical flow and video frames. We propose an inter-modal feature enhancement module (large boxes near top) to attentively fuse the associated audio and video representations in both directions, i.e. audio-to-video and video-to-audio. During training, the final emotion predictions are calculated independently from three sets of features: the video features albeit with audio information fused (i.e. without the model components in the chequered box); the converse using the audio features; and finally using both sets of features after a further fusion stage. During inference, we use the prediction head that performed best on validation data.
  • Figure 2: Heatmaps showing inter-modality attention weights calculated by IFE-Audio (left) and IFE-Video (right), for an example sequence with emotion 'angry'. The horizontal axis corresponds to time-points in one modality, which is fusing in information from the other modality on the vertical axis. Brighter colors indicate stronger attention to the time-point on the vertical axis, from the time-point on the horizontal axis.