Table of Contents
Fetching ...

EmoVerse: A MLLMs-Driven Emotion Representation Dataset for Interpretable Visual Emotion Analysis

Yijie Guo, Dexiang Hong, Weidong Chen, Zihan She, Cheng Ye, Xiaojun Chang, Zhendong Mao

TL;DR

EmoVerse tackles the interpretability gap in Visual Emotion Analysis by introducing a large-scale, open-source dataset annotated with Background-Attribute-Subject triplets and grounded at the object level, alongside dual CES and DES emotion representations. A novel multi-stage Annotation and Verification Pipeline leverages advanced VLMs and a Chain-of-Thought critic to ensure high-quality labels with minimal human effort, while an interpretable DES projector maps visual cues into a $1024$-dimensional affective space and provides explanations for emotion attribution. The dataset is complemented by a two-stage fine-tuned model that yields accurate emotion grounding and interpretable DES embeddings, supported by comprehensive analyses showing strong cross-dataset transferability, robustness of the verification pipeline, and improved interpretability. Together, EmoVerse enables more explainable emotion-aware vision tasks and sets a foundation for future multi-emotion, multimodal, and controllable generation research.

Abstract

Visual Emotion Analysis (VEA) aims to bridge the affective gap between visual content and human emotional responses. Despite its promise, progress in this field remains limited by the lack of open-source and interpretable datasets. Most existing studies assign a single discrete emotion label to an entire image, offering limited insight into how visual elements contribute to emotion. In this work, we introduce EmoVerse, a large-scale open-source dataset that enables interpretable visual emotion analysis through multi-layered, knowledge-graph-inspired annotations. By decomposing emotions into Background-Attribute-Subject (B-A-S) triplets and grounding each element to visual regions, EmoVerse provides word-level and subject-level emotional reasoning. With over 219k images, the dataset further includes dual annotations in Categorical Emotion States (CES) and Dimensional Emotion Space (DES), facilitating unified discrete and continuous emotion representation. A novel multi-stage pipeline ensures high annotation reliability with minimal human effort. Finally, we introduce an interpretable model that maps visual cues into DES representations and provides detailed attribution explanations. Together, the dataset, pipeline, and model form a comprehensive foundation for advancing explainable high-level emotion understanding.

EmoVerse: A MLLMs-Driven Emotion Representation Dataset for Interpretable Visual Emotion Analysis

TL;DR

EmoVerse tackles the interpretability gap in Visual Emotion Analysis by introducing a large-scale, open-source dataset annotated with Background-Attribute-Subject triplets and grounded at the object level, alongside dual CES and DES emotion representations. A novel multi-stage Annotation and Verification Pipeline leverages advanced VLMs and a Chain-of-Thought critic to ensure high-quality labels with minimal human effort, while an interpretable DES projector maps visual cues into a -dimensional affective space and provides explanations for emotion attribution. The dataset is complemented by a two-stage fine-tuned model that yields accurate emotion grounding and interpretable DES embeddings, supported by comprehensive analyses showing strong cross-dataset transferability, robustness of the verification pipeline, and improved interpretability. Together, EmoVerse enables more explainable emotion-aware vision tasks and sets a foundation for future multi-emotion, multimodal, and controllable generation research.

Abstract

Visual Emotion Analysis (VEA) aims to bridge the affective gap between visual content and human emotional responses. Despite its promise, progress in this field remains limited by the lack of open-source and interpretable datasets. Most existing studies assign a single discrete emotion label to an entire image, offering limited insight into how visual elements contribute to emotion. In this work, we introduce EmoVerse, a large-scale open-source dataset that enables interpretable visual emotion analysis through multi-layered, knowledge-graph-inspired annotations. By decomposing emotions into Background-Attribute-Subject (B-A-S) triplets and grounding each element to visual regions, EmoVerse provides word-level and subject-level emotional reasoning. With over 219k images, the dataset further includes dual annotations in Categorical Emotion States (CES) and Dimensional Emotion Space (DES), facilitating unified discrete and continuous emotion representation. A novel multi-stage pipeline ensures high annotation reliability with minimal human effort. Finally, we introduce an interpretable model that maps visual cues into DES representations and provides detailed attribution explanations. Together, the dataset, pipeline, and model form a comprehensive foundation for advancing explainable high-level emotion understanding.

Paper Structure

This paper contains 22 sections, 3 equations, 7 figures, 6 tables.

Figures (7)

  • Figure 1: EmoVerse Dataset introduces the first large-scale visual emotion dataset that combines Categorical Emotion States (CES) and Dimensional Emotion Space (DES) annotations, offering subject-level and word-level emotion attribution with various images.
  • Figure 2: Overview of EmoVerse. EmoVerse collects images from multiple sources. Images collected pass through Annotation and Verification Pipeline. DES annotations are generated from our Interpretable Model, enabling unified understanding of visual emotions.
  • Figure 3: Architecture of our Interpretable Model. Model fine-tunes Qwen model to acquire explanation and incorporates Feature Extractor and Attention Block to acquire DES representation.
  • Figure 4: Knowledge graph based on B-A-S triplets. Decoupled representation facilitates emotion attribution and provides extensibility for understanding and generating diverse affective scenarios.
  • Figure 5: Emotion category distribution statistics. Colored segments show the percentage of each category. $\Delta$ is the minimum and maximum difference and $\sigma$ is the variance. EmoVerse dataset shows great balance in emotion distribution.
  • ...and 2 more figures