Table of Contents
Fetching ...

EmoVid: A Multimodal Emotion Video Dataset for Emotion-Centric Video Understanding and Generation

Zongyang Qiu, Bingyuan Wang, Xingbei Chen, Yingqing He, Zeyu Wang

TL;DR

EmoVid introduces the first large-scale, multimodal emotion-labeled video dataset tailored for stylized and non-realistic content, spanning animation, movie clips, and animated stickers with eight discrete emotions. The dataset provides rich annotations (emotion labels, color attributes, and text captions) and combines human and model-based labeling to achieve scalable, high-quality emotion labeling. A comprehensive benchmark for text-to-video and image-to-video generation demonstrates that fine-tuning state-of-the-art models (Wan2.1) with EmoVid data significantly improves emotional expressiveness and alignment in generated videos and stickers. EmoVid thus advances affective video computing in creative domains and enables emotion-driven content creation, editing, and storytelling with practical implications for animation, cinema, and social media.

Abstract

Emotion plays a pivotal role in video-based expression, but existing video generation systems predominantly focus on low-level visual metrics while neglecting affective dimensions. Although emotion analysis has made progress in the visual domain, the video community lacks dedicated resources to bridge emotion understanding with generative tasks, particularly for stylized and non-realistic contexts. To address this gap, we introduce EmoVid, the first multimodal, emotion-annotated video dataset specifically designed for creative media, which includes cartoon animations, movie clips, and animated stickers. Each video is annotated with emotion labels, visual attributes (brightness, colorfulness, hue), and text captions. Through systematic analysis, we uncover spatial and temporal patterns linking visual features to emotional perceptions across diverse video forms. Building on these insights, we develop an emotion-conditioned video generation technique by fine-tuning the Wan2.1 model. The results show a significant improvement in both quantitative metrics and the visual quality of generated videos for text-to-video and image-to-video tasks. EmoVid establishes a new benchmark for affective video computing. Our work not only offers valuable insights into visual emotion analysis in artistically styled videos, but also provides practical methods for enhancing emotional expression in video generation.

EmoVid: A Multimodal Emotion Video Dataset for Emotion-Centric Video Understanding and Generation

TL;DR

EmoVid introduces the first large-scale, multimodal emotion-labeled video dataset tailored for stylized and non-realistic content, spanning animation, movie clips, and animated stickers with eight discrete emotions. The dataset provides rich annotations (emotion labels, color attributes, and text captions) and combines human and model-based labeling to achieve scalable, high-quality emotion labeling. A comprehensive benchmark for text-to-video and image-to-video generation demonstrates that fine-tuning state-of-the-art models (Wan2.1) with EmoVid data significantly improves emotional expressiveness and alignment in generated videos and stickers. EmoVid thus advances affective video computing in creative domains and enables emotion-driven content creation, editing, and storytelling with practical implications for animation, cinema, and social media.

Abstract

Emotion plays a pivotal role in video-based expression, but existing video generation systems predominantly focus on low-level visual metrics while neglecting affective dimensions. Although emotion analysis has made progress in the visual domain, the video community lacks dedicated resources to bridge emotion understanding with generative tasks, particularly for stylized and non-realistic contexts. To address this gap, we introduce EmoVid, the first multimodal, emotion-annotated video dataset specifically designed for creative media, which includes cartoon animations, movie clips, and animated stickers. Each video is annotated with emotion labels, visual attributes (brightness, colorfulness, hue), and text captions. Through systematic analysis, we uncover spatial and temporal patterns linking visual features to emotional perceptions across diverse video forms. Building on these insights, we develop an emotion-conditioned video generation technique by fine-tuning the Wan2.1 model. The results show a significant improvement in both quantitative metrics and the visual quality of generated videos for text-to-video and image-to-video tasks. EmoVid establishes a new benchmark for affective video computing. Our work not only offers valuable insights into visual emotion analysis in artistically styled videos, but also provides practical methods for enhancing emotional expression in video generation.

Paper Structure

This paper contains 30 sections, 9 equations, 12 figures, 8 tables.

Figures (12)

  • Figure 1: Overview of the EmoVid dataset. The dataset spans eight emotion categories—Contentment, Awe, Amusement, Excitement, Sadness, Disgust, Fear, and Anger—and three content domains: Animation, Movie, and Sticker. The dataset captures diverse emotional expressions in various visual styles and contexts, demonstrating both multimodal richness (with associated text and audio) and cross-domain generality.
  • Figure 2: Relationship between different emotions. We refer to warriner2013norms to arrange emotion categories on the valence-arousal model.
  • Figure 3: Emotion distribution across three video categories. Notably, the imbalance of animation and movie videos reflects the real-world emotional landscape of these domains.
  • Figure 4: Video features and color-emotion correlations. (a) t-SNE visualization of video features. Animation and Movie clusters are separated, with Sticker samples overlapping both, reflecting their hybrid content characteristics. (b) Positive-to-total emotion ratio across bins of colorfulness and brightness, exhibiting a distinct upward trend. (c) Emotion transition matrix from consecutive movie clips. Diagonal dominance indicates strong emotional persistence.
  • Figure 5: Qualitative results. (a) Comparison between the original Wan2.1 I2V model and our fine-tuned one. The ✓ indicates better emotional alignment. (b) Emotion-conditioned animated sticker generation using the fine-tuned Wan2.1 I2V model.
  • ...and 7 more figures