Hi-EF: Benchmarking Emotion Forecasting in Human-interaction

Haoran Wang; Xinji Mai; Zeng Tao; Junxiong Lin; Xuan Tong; Ivy Pan; Shaoqi Yan; Yan Wang; Shuyong Gao

Hi-EF: Benchmarking Emotion Forecasting in Human-interaction

Haoran Wang, Xinji Mai, Zeng Tao, Junxiong Lin, Xuan Tong, Ivy Pan, Shaoqi Yan, Yan Wang, Shuyong Gao

TL;DR

This work reframes Affective Forecasting as Emotion Forecasting (EF) in two-party interactions and introduces the Hi-EF dataset with Multilayered-Contextual Interaction Samples (MCIS) to predict a partner's future emotion from short-term context and current states. A three-function EF paradigm—context fusion, current emotion recognition, and future emotion forecasting—underpins the dataset design, labeling, and baseline modeling, which together demonstrate the task's feasibility. Through comprehensive experiments across multimodal encoders and fusion strategies, the study shows the value of combining contextual information with intra- and inter-video fusion, highlighting practical implications for emotion-aware systems and interactive agents. The Hi-EF resource and EF framework lay groundwork for further research in affective computing beyond current emotion recognition toward predictive, interaction-aware emotion modeling.

Abstract

Affective Forecasting is an psychology task that involves predicting an individual's future emotional responses, often hampered by reliance on external factors leading to inaccuracies, and typically remains at a qualitative analysis stage. To address these challenges, we narrows the scope of Affective Forecasting by introducing the concept of Human-interaction-based Emotion Forecasting (EF). This task is set within the context of a two-party interaction, positing that an individual's emotions are significantly influenced by their interaction partner's emotional expressions and informational cues. This dynamic provides a structured perspective for exploring the patterns of emotional change, thereby enhancing the feasibility of emotion forecasting.

Hi-EF: Benchmarking Emotion Forecasting in Human-interaction

TL;DR

Abstract

Paper Structure (13 sections, 4 figures, 3 tables)

This paper contains 13 sections, 4 figures, 3 tables.

Introduction
Significance of EF Task
Relevant Datasets of Emotion Forecasting: Emotion Recognition Datasets
The Hi-EF Dataset
Paradigm of Emotion Forecasting
Design of Multilayered-Contextual Interaction Sample (MCIS)
Dataset Construction Procedure
Dataset Statistics
Experiment
Experiment Setup
Experimental Results
Conclusions and Discussion
Acknowledgements

Figures (4)

Figure 1: Comparing EF task with ER task.
Figure 2: Overview of MCIS: In order to realize the task of forecasting emotion during the interaction process, we design a new form of data MCIS with multilayered-contextual information which includes contextual information (clip I and II), current emotion state of Party A (clip III) and future emotion of Party B (clip IV). To better evaluate Party A’s emotional state and more accurately predict Party B’s emotions, we provide three modalities for MCIS: video, audio and text.
Figure 3: Overview of the Construction Procedure for the Hi-EF Dataset. a) Shows the process of generating candidate MCIS. b) Illustrates the anomaly detection and the rules and multi-facet assisted reliable annotation for candidate MCIS. c) Details the annotation process. All other annotation information is intended to support the more accurate labeling of emotions. Naturally, we will provide all the annotation details within the dataset.
Figure 4: Overview of our model architecture. The first three clips in MCIS are respectively input into the intra-video fusion block, where the obtained features are then fed into the inter-video fusion block for prediction. The final prediction result is compared with the emotion label of clip IV in MCIS to calculate the loss. In the intra-video fusion block, video frames are divided into facial, posture, and scene parts, which are separately input into the vision encoder. These features are then fused to form the video modality information. Meanwhile, audio and text are input into the audio encoder and text encoder, respectively. After obtaining the features from the three modalities, they are fused to produce the feature representation of a single video. The inter-video fusion block primarily fuses the features of the three clips.

Hi-EF: Benchmarking Emotion Forecasting in Human-interaction

TL;DR

Abstract

Hi-EF: Benchmarking Emotion Forecasting in Human-interaction

Authors

TL;DR

Abstract

Table of Contents

Figures (4)