Table of Contents
Fetching ...

Towards Emotion Analysis in Short-form Videos: A Large-Scale Dataset and Baseline

Xuecheng Wu, Heli Sun, Junxiao Xue, Jiayu Nie, Xiangyan Kong, Ruofan Zhai, Liang He

TL;DR

eMotions introduces the first large-scale short-form video emotion analysis dataset with 27,996 videos across three platforms and six Plutchik emotions, complemented by a rigorous multi-stage annotation workflow and two variant subsets. It proposes AV-CANet, an end-to-end audio-visual baseline built on a Video Swin Transformer backbone, equipped with Local-Global Fusion (LGF) and Emotion Polarity Enhanced CE Loss (EP-CE) to address cross-modal inconsistencies from local to global scales. Through extensive experiments on eMotions and four public datasets, AV-CANet demonstrates superior performance and provides actionable insights into modality contributions, fusion strategies, and loss design for SV-based VEA. The work offers a practical foundation for affective content analysis in SVs and paves the way for more robust, culturally aware, and scalable emotion understanding in real-world multimedia data.

Abstract

Nowadays, short-form videos (SVs) are essential to web information acquisition and sharing in our daily life. The prevailing use of SVs to spread emotions leads to the necessity of conducting video emotion analysis (VEA) towards SVs. Considering the lack of SVs emotion data, we introduce a large-scale dataset named eMotions, comprising 27,996 videos. Meanwhile, we alleviate the impact of subjectivities on labeling quality by emphasizing better personnel allocations and multi-stage annotations. In addition, we provide the category-balanced and test-oriented variants through targeted data sampling. Some commonly used videos, such as facial expressions, have been well studied. However, it is still challenging to analysis the emotions in SVs. Since the broader content diversity brings more distinct semantic gaps and difficulties in learning emotion-related features, and there exists local biases and collective information gaps caused by the emotion inconsistence under the prevalently audio-visual co-expressions. To tackle these challenges, we present an end-to-end audio-visual baseline AV-CANet which employs the video transformer to better learn semantically relevant representations. We further design the Local-Global Fusion Module to progressively capture the correlations of audio-visual features. The EP-CE Loss is then introduced to guide model optimization. Extensive experimental results on seven datasets demonstrate the effectiveness of AV-CANet, while providing broad insights for future works. Besides, we investigate the key components of AV-CANet by ablation studies. Datasets and code will be fully open soon.

Towards Emotion Analysis in Short-form Videos: A Large-Scale Dataset and Baseline

TL;DR

eMotions introduces the first large-scale short-form video emotion analysis dataset with 27,996 videos across three platforms and six Plutchik emotions, complemented by a rigorous multi-stage annotation workflow and two variant subsets. It proposes AV-CANet, an end-to-end audio-visual baseline built on a Video Swin Transformer backbone, equipped with Local-Global Fusion (LGF) and Emotion Polarity Enhanced CE Loss (EP-CE) to address cross-modal inconsistencies from local to global scales. Through extensive experiments on eMotions and four public datasets, AV-CANet demonstrates superior performance and provides actionable insights into modality contributions, fusion strategies, and loss design for SV-based VEA. The work offers a practical foundation for affective content analysis in SVs and paves the way for more robust, culturally aware, and scalable emotion understanding in real-world multimedia data.

Abstract

Nowadays, short-form videos (SVs) are essential to web information acquisition and sharing in our daily life. The prevailing use of SVs to spread emotions leads to the necessity of conducting video emotion analysis (VEA) towards SVs. Considering the lack of SVs emotion data, we introduce a large-scale dataset named eMotions, comprising 27,996 videos. Meanwhile, we alleviate the impact of subjectivities on labeling quality by emphasizing better personnel allocations and multi-stage annotations. In addition, we provide the category-balanced and test-oriented variants through targeted data sampling. Some commonly used videos, such as facial expressions, have been well studied. However, it is still challenging to analysis the emotions in SVs. Since the broader content diversity brings more distinct semantic gaps and difficulties in learning emotion-related features, and there exists local biases and collective information gaps caused by the emotion inconsistence under the prevalently audio-visual co-expressions. To tackle these challenges, we present an end-to-end audio-visual baseline AV-CANet which employs the video transformer to better learn semantically relevant representations. We further design the Local-Global Fusion Module to progressively capture the correlations of audio-visual features. The EP-CE Loss is then introduced to guide model optimization. Extensive experimental results on seven datasets demonstrate the effectiveness of AV-CANet, while providing broad insights for future works. Besides, we investigate the key components of AV-CANet by ablation studies. Datasets and code will be fully open soon.
Paper Structure (19 sections, 11 equations, 7 figures, 9 tables)

This paper contains 19 sections, 11 equations, 7 figures, 9 tables.

Figures (7)

  • Figure 1: An overview of eMotions composed of 27,996 videos of six emotions across Douyin, Kuaishou, and Tiktok. The colors of frame borders specify the emotional categories to which they belong: Excitation, Fear, Neutral, Relaxation, Sadness, Tension.
  • Figure 2: The overall illustration of emotion inconsistence. (a) Separate visual or auditory modality evokes different emotion, leading to expression conflict. (b)&(c) The lack of emotional information evoked from visual or auditory modality results in emotion disalignment.
  • Figure 3: (a) The overall pipeline of dataset construction. (b) The detailed workflow of personnel assignment and adjustments.
  • Figure 4: The MOS scores of six VEA datasets and our eMotions. Ek6: Ekman6. Mv: Music_video. Ve8: VideoEmotion8.
  • Figure 5: (a) & (b) Word clouds of topics and content types in eMotions. Larger text size indicates a higher frequency of occurrence. (c) Duration distribution of short-form videos in our dataset.
  • ...and 2 more figures