Table of Contents
Fetching ...

Mood as a Contextual Cue for Improved Emotion Inference

Soujanya Narayana, Ibrahim Radwan, Ramanathan Subramanian, Roland Goecke

TL;DR

The paper investigates mood as a long-term contextual cue for time-continuous emotion inference, proposing multimodal fusion of mood and emotion-change ($\Delta$) information to predict frame-level valence. It introduces two- and three-branch networks (M-ValNet and M$\Delta$-ValNet) and integrates sequential spatial-channel attention to improve predictions, validating results on EMMA and AffWild2. Empirical findings show that mood plus $\Delta$ context improves CCC-based valence prediction, with attention modules providing further gains and demonstrating cross-dataset generalisability. The work highlights the importance of incorporating long-term affect as context in affective computing and points to future work involving longer mood sequences and additional modalities for robust emotion inference.

Abstract

Psychological studies observe that emotions are rarely expressed in isolation and are typically influenced by the surrounding context. While recent studies effectively harness uni- and multimodal cues for emotion inference, hardly any study has considered the effect of long-term affect, or \emph{mood}, on short-term \emph{emotion} inference. This study (a) proposes time-continuous \emph{valence} prediction from videos, fusing multimodal cues including \emph{mood} and \emph{emotion-change} ($Δ$) labels, (b) serially integrates spatial and channel attention for improved inference, and (c) demonstrates algorithmic generalisability with experiments on the \emph{EMMA} and \emph{AffWild2} datasets. Empirical results affirm that utilising mood labels is highly beneficial for dynamic valence prediction. Comparing \emph{unimodal} (training only with mood labels) vs \emph{multimodal} (training with mood and $Δ$ labels) results, inference performance improves for the latter, conveying that both long and short-term contextual cues are critical for time-continuous emotion inference.

Mood as a Contextual Cue for Improved Emotion Inference

TL;DR

The paper investigates mood as a long-term contextual cue for time-continuous emotion inference, proposing multimodal fusion of mood and emotion-change () information to predict frame-level valence. It introduces two- and three-branch networks (M-ValNet and M-ValNet) and integrates sequential spatial-channel attention to improve predictions, validating results on EMMA and AffWild2. Empirical findings show that mood plus context improves CCC-based valence prediction, with attention modules providing further gains and demonstrating cross-dataset generalisability. The work highlights the importance of incorporating long-term affect as context in affective computing and points to future work involving longer mood sequences and additional modalities for robust emotion inference.

Abstract

Psychological studies observe that emotions are rarely expressed in isolation and are typically influenced by the surrounding context. While recent studies effectively harness uni- and multimodal cues for emotion inference, hardly any study has considered the effect of long-term affect, or \emph{mood}, on short-term \emph{emotion} inference. This study (a) proposes time-continuous \emph{valence} prediction from videos, fusing multimodal cues including \emph{mood} and \emph{emotion-change} () labels, (b) serially integrates spatial and channel attention for improved inference, and (c) demonstrates algorithmic generalisability with experiments on the \emph{EMMA} and \emph{AffWild2} datasets. Empirical results affirm that utilising mood labels is highly beneficial for dynamic valence prediction. Comparing \emph{unimodal} (training only with mood labels) vs \emph{multimodal} (training with mood and labels) results, inference performance improves for the latter, conveying that both long and short-term contextual cues are critical for time-continuous emotion inference.
Paper Structure (34 sections, 3 equations, 4 figures, 5 tables)

This paper contains 34 sections, 3 equations, 4 figures, 5 tables.

Figures (4)

  • Figure 1: Problem Overview: We perform emotion inference in videos using mood and emotion change ($\Delta$) labels as contextual cues. (1) For each clip (an exemplar clip segment from the AffWild2 dataset kollias2019expression is denoted in green), we utilise existing mood labels or derive them from valence annotations. (2) Moreover, we utilise emotion-change ($\Delta$) information in the form of the valence differential between the first and last frames of the video clip. (3) Employing mood and emotion-change labels, we predict the valence rating for the succeeding frame (red).
  • Figure 2: (Left) Exemplar frames in EMMA katsimerou2016crowdsourcing, showing an occluded/non-frontal face. (Right) Illustration of an input sample where the raw video is face-cropped, aligned and sub-sampled (top) to generate the visual information sequence from $t_0$ to $t_k$ for contextual emotion inference (middle), based on which valence at $t_{k+1}$ is inferred (bottom).
  • Figure 3: Architecture of the M$\Delta$-ValNet, a three-branch network utilising both mood and $\Delta$ labels for emotion inference. The architecture of ValNet comprises only the bottom branch. Similarly, M-ValNet comprises the top and bottom branches depicted above.
  • Figure 4: (Left) Spatial attention module. (Right) Channel attention module.