Table of Contents
Fetching ...

Audio-Visual Compound Expression Recognition Method based on Late Modality Fusion and Rule-based Decision

Elena Ryumina, Maxim Markitantov, Dmitry Ryumin, Heysem Kaya, Alexey Karpov

TL;DR

The paper tackles zero-shot compound emotion recognition by deploying an audio-visual pipeline built from basic emotion recognizers (static visual, dynamic visual, and audio) and two rule-based decision mechanisms to predict seven compound expressions without task-specific training data. It introduces a two-stage fusion scheme, using Dirichlet- and hierarchical-weighting of emotion-probability distributions, followed by Rule 1 and Rule 2 to derive final predictions. The approach is evaluated through multi-corpus training and cross-corpus validation, achieving a reported F1 score of 22.01% on the C-EXPR-DB test subset, and shows that audio-visual fusion can provide measurable gains while weight configurations reveal complementary model contributions. This framework offers a promising basis for annotating audio-visual data containing basic and compound emotions, with potential application as an efficient annotation tool in real-world scenarios.

Abstract

This paper presents the results of the SUN team for the Compound Expressions Recognition Challenge of the 6th ABAW Competition. We propose a novel audio-visual method for compound expression recognition. Our method relies on emotion recognition models that fuse modalities at the emotion probability level, while decisions regarding the prediction of compound expressions are based on predefined rules. Notably, our method does not use any training data specific to the target task. Thus, the problem is a zero-shot classification task. The method is evaluated in multi-corpus training and cross-corpus validation setups. Using our proposed method is achieved an F1-score value equals to 22.01% on the C-EXPR-DB test subset. Our findings from the challenge demonstrate that the proposed method can potentially form a basis for developing intelligent tools for annotating audio-visual data in the context of human's basic and compound emotions.

Audio-Visual Compound Expression Recognition Method based on Late Modality Fusion and Rule-based Decision

TL;DR

The paper tackles zero-shot compound emotion recognition by deploying an audio-visual pipeline built from basic emotion recognizers (static visual, dynamic visual, and audio) and two rule-based decision mechanisms to predict seven compound expressions without task-specific training data. It introduces a two-stage fusion scheme, using Dirichlet- and hierarchical-weighting of emotion-probability distributions, followed by Rule 1 and Rule 2 to derive final predictions. The approach is evaluated through multi-corpus training and cross-corpus validation, achieving a reported F1 score of 22.01% on the C-EXPR-DB test subset, and shows that audio-visual fusion can provide measurable gains while weight configurations reveal complementary model contributions. This framework offers a promising basis for annotating audio-visual data containing basic and compound emotions, with potential application as an efficient annotation tool in real-world scenarios.

Abstract

This paper presents the results of the SUN team for the Compound Expressions Recognition Challenge of the 6th ABAW Competition. We propose a novel audio-visual method for compound expression recognition. Our method relies on emotion recognition models that fuse modalities at the emotion probability level, while decisions regarding the prediction of compound expressions are based on predefined rules. Notably, our method does not use any training data specific to the target task. Thus, the problem is a zero-shot classification task. The method is evaluated in multi-corpus training and cross-corpus validation setups. Using our proposed method is achieved an F1-score value equals to 22.01% on the C-EXPR-DB test subset. Our findings from the challenge demonstrate that the proposed method can potentially form a basis for developing intelligent tools for annotating audio-visual data in the context of human's basic and compound emotions.
Paper Structure (11 sections, 6 equations, 3 figures, 2 tables)

This paper contains 11 sections, 6 equations, 3 figures, 2 tables.

Figures (3)

  • Figure 1: Pipeline of the proposed audio-visual method. PD refers to probability distribution.
  • Figure 2: Weights for different modality fusion. VS, VD, and A refer to static visual, dynamic visual, and acoustic models, respectively. , , , , , , and refer to the weights of seven emotions used for Dirichlet-based weighting, Mo to the weights of models used for hierarchical weighting.
  • Figure 3: An example of prediction using video from the C-EXPR-DB corpus.