Towards Emotion Analysis in Short-form Videos: A Large-Scale Dataset and Baseline
Xuecheng Wu, Heli Sun, Junxiao Xue, Jiayu Nie, Xiangyan Kong, Ruofan Zhai, Liang He
TL;DR
eMotions introduces the first large-scale short-form video emotion analysis dataset with 27,996 videos across three platforms and six Plutchik emotions, complemented by a rigorous multi-stage annotation workflow and two variant subsets. It proposes AV-CANet, an end-to-end audio-visual baseline built on a Video Swin Transformer backbone, equipped with Local-Global Fusion (LGF) and Emotion Polarity Enhanced CE Loss (EP-CE) to address cross-modal inconsistencies from local to global scales. Through extensive experiments on eMotions and four public datasets, AV-CANet demonstrates superior performance and provides actionable insights into modality contributions, fusion strategies, and loss design for SV-based VEA. The work offers a practical foundation for affective content analysis in SVs and paves the way for more robust, culturally aware, and scalable emotion understanding in real-world multimedia data.
Abstract
Nowadays, short-form videos (SVs) are essential to web information acquisition and sharing in our daily life. The prevailing use of SVs to spread emotions leads to the necessity of conducting video emotion analysis (VEA) towards SVs. Considering the lack of SVs emotion data, we introduce a large-scale dataset named eMotions, comprising 27,996 videos. Meanwhile, we alleviate the impact of subjectivities on labeling quality by emphasizing better personnel allocations and multi-stage annotations. In addition, we provide the category-balanced and test-oriented variants through targeted data sampling. Some commonly used videos, such as facial expressions, have been well studied. However, it is still challenging to analysis the emotions in SVs. Since the broader content diversity brings more distinct semantic gaps and difficulties in learning emotion-related features, and there exists local biases and collective information gaps caused by the emotion inconsistence under the prevalently audio-visual co-expressions. To tackle these challenges, we present an end-to-end audio-visual baseline AV-CANet which employs the video transformer to better learn semantically relevant representations. We further design the Local-Global Fusion Module to progressively capture the correlations of audio-visual features. The EP-CE Loss is then introduced to guide model optimization. Extensive experimental results on seven datasets demonstrate the effectiveness of AV-CANet, while providing broad insights for future works. Besides, we investigate the key components of AV-CANet by ablation studies. Datasets and code will be fully open soon.
