Table of Contents
Fetching ...

VidEmo: Affective-Tree Reasoning for Emotion-Centric Video Foundation Models

Zhicheng Zhang, Weicheng Wang, Yongjie Zhu, Wenyu Qin, Pengfei Wan, Di Zhang, Jufeng Yang

TL;DR

VidEmo introduces a novel family of video emotion foundation models that unify attribute perception, expression analysis, and high-level emotion understanding through a two-stage training pipeline: curriculum emotion learning and affective-tree reinforcement learning. The framework leverages a hierarchical, reasoning-based approach at inference and uses a large, emotion-centric Emo-CFG dataset to train and evaluate fine-grained emotional understanding. Empirical results show VidEmo achieves state-of-the-art performance across 15 face-perception tasks and strong gains on downstream emotion tasks, improving both open- and closed-source VideoLLMs. The work provides a scalable foundation for interpretable, emotion-centric video analysis and offers Emo-CFG as a robust data resource for future research in affective reasoning.

Abstract

Understanding and predicting emotion from videos has gathered significant attention in recent studies, driven by advancements in video large language models (VideoLLMs). While advanced methods have made progress in video emotion analysis, the intrinsic nature of emotions poses significant challenges. Emotions are characterized by dynamic and cues-dependent properties, making it difficult to understand complex and evolving emotional states with reasonable rationale. To tackle these challenges, we propose a novel affective cues-guided reasoning framework that unifies fundamental attribute perception, expression analysis, and high-level emotional understanding in a stage-wise manner. At the core of our approach is a family of video emotion foundation models (VidEmo), specifically designed for emotion reasoning and instruction-following. These models undergo a two-stage tuning process: first, curriculum emotion learning for injecting emotion knowledge, followed by affective-tree reinforcement learning for emotion reasoning. Moreover, we establish a foundational data infrastructure and introduce a emotion-centric fine-grained dataset (Emo-CFG) consisting of 2.1M diverse instruction-based samples. Emo-CFG includes explainable emotional question-answering, fine-grained captions, and associated rationales, providing essential resources for advancing emotion understanding tasks. Experimental results demonstrate that our approach achieves competitive performance, setting a new milestone across 15 face perception tasks.

VidEmo: Affective-Tree Reasoning for Emotion-Centric Video Foundation Models

TL;DR

VidEmo introduces a novel family of video emotion foundation models that unify attribute perception, expression analysis, and high-level emotion understanding through a two-stage training pipeline: curriculum emotion learning and affective-tree reinforcement learning. The framework leverages a hierarchical, reasoning-based approach at inference and uses a large, emotion-centric Emo-CFG dataset to train and evaluate fine-grained emotional understanding. Empirical results show VidEmo achieves state-of-the-art performance across 15 face-perception tasks and strong gains on downstream emotion tasks, improving both open- and closed-source VideoLLMs. The work provides a scalable foundation for interpretable, emotion-centric video analysis and offers Emo-CFG as a robust data resource for future research in affective reasoning.

Abstract

Understanding and predicting emotion from videos has gathered significant attention in recent studies, driven by advancements in video large language models (VideoLLMs). While advanced methods have made progress in video emotion analysis, the intrinsic nature of emotions poses significant challenges. Emotions are characterized by dynamic and cues-dependent properties, making it difficult to understand complex and evolving emotional states with reasonable rationale. To tackle these challenges, we propose a novel affective cues-guided reasoning framework that unifies fundamental attribute perception, expression analysis, and high-level emotional understanding in a stage-wise manner. At the core of our approach is a family of video emotion foundation models (VidEmo), specifically designed for emotion reasoning and instruction-following. These models undergo a two-stage tuning process: first, curriculum emotion learning for injecting emotion knowledge, followed by affective-tree reinforcement learning for emotion reasoning. Moreover, we establish a foundational data infrastructure and introduce a emotion-centric fine-grained dataset (Emo-CFG) consisting of 2.1M diverse instruction-based samples. Emo-CFG includes explainable emotional question-answering, fine-grained captions, and associated rationales, providing essential resources for advancing emotion understanding tasks. Experimental results demonstrate that our approach achieves competitive performance, setting a new milestone across 15 face perception tasks.

Paper Structure

This paper contains 31 sections, 2 equations, 26 figures, 12 tables.

Figures (26)

  • Figure 1: Selected examples of inputs and outputs obtained from VidEmo. Apart from providing toolkits for basic attribute perception and expression analysis (top), VidEmo extends the cognitive capacity and is able to generate fine-grained emotional captions with explainable rationale (bottom).
  • Figure 2: Results Overview. Our best model, VidEmo-T1, shows superior performance across 15 face perception tasks, surpassing advanced milestone (Gemini 2.0: 5th Feb, 2025) on 14 of 15 tasks.
  • Figure 3: Pipeline of VidEmo. (a) Training: The model is trained using curriculum emotion learning, divided into three stages: attribute, expression, and emotion tuning. A reference model provides initial parameters, and a policy model is trained with reward feedback. (b) Reasoning: The policy model performs hierarchical reasoning by sampling from the best attributes, expressions, and emotions to generate the final emotional output.
  • Figure 4: Visualization on attribute perception, expression analysis, and emotion understanding.
  • Figure 5: Data Curation Pipeline of the Emo-CFG dataset. (a) The source of data from 17 datasets. (b) The illustration of data labeling steps. (c) The illustration of data verification loop.
  • ...and 21 more figures