Table of Contents
Fetching ...

Affective Video Content Analysis: Decade Review and New Perspectives

Junxiao Xue, Jie Wang, Xuecheng Wu, Qian Zhang

TL;DR

Affective Video Content Analysis (AVCA) addresses how videos evoke emotions, integrating psychology-based emotion models with multimodal learning. The paper surveys decade-long progress, detailing unimodal and multimodal AVCA methods, emotion representations, and benchmark datasets, while highlighting fusion strategies and evaluation standards. It identifies key challenges—feature extraction, expression subjectivity, and multimodal fusion—and outlines future directions in attention-based fusion, trusted results, and large-scale, diverse datasets. The work underscores AVCA’s potential to enable emotion-aware applications in human-computer interaction, content recommendation, and public opinion analysis, motivating further advances in robust, interpretable, and scalable systems.

Abstract

Video content is rich in semantics and has the ability to evoke various emotions in viewers. In recent years, with the rapid development of affective computing and the explosive growth of visual data, affective video content analysis (AVCA) as an essential branch of affective computing has become a widely researched topic. In this study, we comprehensively review the development of AVCA over the past decade, particularly focusing on the most advanced methods adopted to address the three major challenges of video feature extraction, expression subjectivity, and multimodal feature fusion. We first introduce the widely used emotion representation models in AVCA and describe commonly used datasets. We summarize and compare representative methods in the following aspects: (1) unimodal AVCA models, including facial expression recognition and posture emotion recognition; (2) multimodal AVCA models, including feature fusion, decision fusion, and attention-based multimodal models; (3) model performance evaluation standards. Finally, we discuss future challenges and promising research directions, such as emotion recognition and public opinion analysis, human-computer interaction, and emotional intelligence.

Affective Video Content Analysis: Decade Review and New Perspectives

TL;DR

Affective Video Content Analysis (AVCA) addresses how videos evoke emotions, integrating psychology-based emotion models with multimodal learning. The paper surveys decade-long progress, detailing unimodal and multimodal AVCA methods, emotion representations, and benchmark datasets, while highlighting fusion strategies and evaluation standards. It identifies key challenges—feature extraction, expression subjectivity, and multimodal fusion—and outlines future directions in attention-based fusion, trusted results, and large-scale, diverse datasets. The work underscores AVCA’s potential to enable emotion-aware applications in human-computer interaction, content recommendation, and public opinion analysis, motivating further advances in robust, interpretable, and scalable systems.

Abstract

Video content is rich in semantics and has the ability to evoke various emotions in viewers. In recent years, with the rapid development of affective computing and the explosive growth of visual data, affective video content analysis (AVCA) as an essential branch of affective computing has become a widely researched topic. In this study, we comprehensively review the development of AVCA over the past decade, particularly focusing on the most advanced methods adopted to address the three major challenges of video feature extraction, expression subjectivity, and multimodal feature fusion. We first introduce the widely used emotion representation models in AVCA and describe commonly used datasets. We summarize and compare representative methods in the following aspects: (1) unimodal AVCA models, including facial expression recognition and posture emotion recognition; (2) multimodal AVCA models, including feature fusion, decision fusion, and attention-based multimodal models; (3) model performance evaluation standards. Finally, we discuss future challenges and promising research directions, such as emotion recognition and public opinion analysis, human-computer interaction, and emotional intelligence.
Paper Structure (24 sections, 5 equations, 18 figures, 4 tables)

This paper contains 24 sections, 5 equations, 18 figures, 4 tables.

Figures (18)

  • Figure 1: The temporal nature plays an important role in AVCA. These images are captured from a 5-second consecutive segment from the same video, showcasing Curry's emotions after winning the NBA championship once again. In images (a) and (d), Curry expresses different emotions (sadness vs. happiness).
  • Figure 2: The contextual information also proves helpful for the AVCA task. (a) The image without and with the detailed scene context expresses different emotions (fear vs. surprise). (b) The voice can also infulence the emtion perception of the same video (sadness vs. positive).
  • Figure 3: Illustration of the expression subjectivity. The same gestures or actions may elicit significantly different emotional expressions from content creators in different cultures. In most countries, people may express agreement and approval with images (a) and (b), individuals in the Mediterranean region might convey disdain and negation instead.
  • Figure 4: The credibility of model classification results plays a vital role in multimodal feature fusion. (a) The results of fusion without credibility in the image are incorrect. (b) Incorporating confidence as fusion weights leads to a correct output.
  • Figure 5: Milestones in both general affective computing (above line, blue) and affective video content analysis (below line, red).
  • ...and 13 more figures