Table of Contents
Fetching ...

Saliency-guided Emotion Modeling: Predicting Viewer Reactions from Video Stimuli

Akhila Yaragoppa, Siddharth

TL;DR

This work investigates how visual saliency influences viewer emotions in video stimuli by introducing two interpretable features, Saliency Area and Number of Salient Regions, derived from the HD^2S saliency model. These features are related to viewers' emotions through OpenFace-derived Facial Action Units and Canonical Correlation Analysis on the MAHNOB-HCI dataset, revealing that multiple salient regions tend to evoke high valence and low arousal, while a single salient region is associated with low valence and high arousal; self-reported emotions often diverge from facial expressions. The study demonstrates that saliency-driven cues can provide a computationally efficient and interpretable alternative for emotion modeling with implications for content creation and affective computing, while acknowledging limitations in causality, generalizability, and dependence on a single saliency model. Overall, it highlights the potential of saliency-based features to predict and guide audience reactions, complementing traditional self-report and biosensing approaches.

Abstract

Understanding the emotional impact of videos is crucial for applications in content creation, advertising, and Human-Computer Interaction (HCI). Traditional affective computing methods rely on self-reported emotions, facial expression analysis, and biosensing data, yet they often overlook the role of visual saliency -- the naturally attention-grabbing regions within a video. In this study, we utilize deep learning to introduce a novel saliency-based approach to emotion prediction by extracting two key features: saliency area and number of salient regions. Using the HD2S saliency model and OpenFace facial action unit analysis, we examine the relationship between video saliency and viewer emotions. Our findings reveal three key insights: (1) Videos with multiple salient regions tend to elicit high-valence, low-arousal emotions, (2) Videos with a single dominant salient region are more likely to induce low-valence, high-arousal responses, and (3) Self-reported emotions often misalign with facial expression-based emotion detection, suggesting limitations in subjective reporting. By leveraging saliency-driven insights, this work provides a computationally efficient and interpretable alternative for emotion modeling, with implications for content creation, personalized media experiences, and affective computing research.

Saliency-guided Emotion Modeling: Predicting Viewer Reactions from Video Stimuli

TL;DR

This work investigates how visual saliency influences viewer emotions in video stimuli by introducing two interpretable features, Saliency Area and Number of Salient Regions, derived from the HD^2S saliency model. These features are related to viewers' emotions through OpenFace-derived Facial Action Units and Canonical Correlation Analysis on the MAHNOB-HCI dataset, revealing that multiple salient regions tend to evoke high valence and low arousal, while a single salient region is associated with low valence and high arousal; self-reported emotions often diverge from facial expressions. The study demonstrates that saliency-driven cues can provide a computationally efficient and interpretable alternative for emotion modeling with implications for content creation and affective computing, while acknowledging limitations in causality, generalizability, and dependence on a single saliency model. Overall, it highlights the potential of saliency-based features to predict and guide audience reactions, complementing traditional self-report and biosensing approaches.

Abstract

Understanding the emotional impact of videos is crucial for applications in content creation, advertising, and Human-Computer Interaction (HCI). Traditional affective computing methods rely on self-reported emotions, facial expression analysis, and biosensing data, yet they often overlook the role of visual saliency -- the naturally attention-grabbing regions within a video. In this study, we utilize deep learning to introduce a novel saliency-based approach to emotion prediction by extracting two key features: saliency area and number of salient regions. Using the HD2S saliency model and OpenFace facial action unit analysis, we examine the relationship between video saliency and viewer emotions. Our findings reveal three key insights: (1) Videos with multiple salient regions tend to elicit high-valence, low-arousal emotions, (2) Videos with a single dominant salient region are more likely to induce low-valence, high-arousal responses, and (3) Self-reported emotions often misalign with facial expression-based emotion detection, suggesting limitations in subjective reporting. By leveraging saliency-driven insights, this work provides a computationally efficient and interpretable alternative for emotion modeling, with implications for content creation, personalized media experiences, and affective computing research.

Paper Structure

This paper contains 13 sections, 9 figures.

Figures (9)

  • Figure 1: Distribution of participants' self-reported emotions in the MAHNOB-HCI Dataset on the Emotion Circumplex Model
  • Figure 2: Features "saliency area" and "number of salient regions" (extracted using a deep neural network) vs. time for an example video stimulus (a few video frames and overlaid saliency heatmaps shown for reference) that the participants watch.
  • Figure 3: Correlation between saliency features and the participant recorded emotions (Valence and Arousal).
  • Figure 4: Example frames from a few visual stimuli having high valence and low arousal. The heatmap superimposed over the frame represents the salient regions identified by the deep learning network. As seen in (a) and (c) there can be multiple salient regions in a single frame.
  • Figure 5: Example frames from a few visual stimuli having low valence and high arousal. The heatmap over the frame highlights salient regions identified by the deep learning network.
  • ...and 4 more figures