EmotiCrafter: Text-to-Emotional-Image Generation based on Valence-Arousal Model

Shengqi Dang; Yi He; Long Ling; Ziqing Qian; Nanxuan Zhao; Nan Cao

EmotiCrafter: Text-to-Emotional-Image Generation based on Valence-Arousal Model

Shengqi Dang, Yi He, Long Ling, Ziqing Qian, Nanxuan Zhao, Nan Cao

TL;DR

EmotiCrafter tackles continuous emotional control in text-to-image generation by embedding continuous Valence-Arousal values into textual prompts and injecting them into a diffusion backbone. The approach introduces an Emotion-Embedding Network that maps $V$ and $A$ into prompt features and fuses them with Stable Diffusion XL via cross-attention, guided by a density-weighted loss to balance uneven V-A sampling. Empirical results show precise emotion-content alignment, superior continuity over baselines and GPT-4+SDXL, and meaningful user-study gains, advancing affective content creation. The work provides methodological innovations (V-A encoding, 12-block Emotion Injection Transformer, KDE-based loss) and a training dataset, with implications for fine-grained, emotionally aware image generation in practical applications.

Abstract

Recent research shows that emotions can enhance users' cognition and influence information communication. While research on visual emotion analysis is extensive, limited work has been done on helping users generate emotionally rich image content. Existing work on emotional image generation relies on discrete emotion categories, making it challenging to capture complex and subtle emotional nuances accurately. Additionally, these methods struggle to control the specific content of generated images based on text prompts. In this work, we introduce the new task of continuous emotional image content generation (C-EICG) and present EmotiCrafter, an emotional image generation model that generates images based on text prompts and Valence-Arousal values. Specifically, we propose a novel emotion-embedding mapping network that embeds Valence-Arousal values into textual features, enabling the capture of specific emotions in alignment with intended input prompts. Additionally, we introduce a loss function to enhance emotion expression. The experimental results show that our method effectively generates images representing specific emotions with the desired content and outperforms existing techniques.

EmotiCrafter: Text-to-Emotional-Image Generation based on Valence-Arousal Model

TL;DR

and

into prompt features and fuses them with Stable Diffusion XL via cross-attention, guided by a density-weighted loss to balance uneven V-A sampling. Empirical results show precise emotion-content alignment, superior continuity over baselines and GPT-4+SDXL, and meaningful user-study gains, advancing affective content creation. The work provides methodological innovations (V-A encoding, 12-block Emotion Injection Transformer, KDE-based loss) and a training dataset, with implications for fine-grained, emotionally aware image generation in practical applications.

Abstract

Paper Structure (16 sections, 9 equations, 9 figures, 4 tables)

This paper contains 16 sections, 9 equations, 9 figures, 4 tables.

Introduction
Related work
Visual Emotion Analysis
Image Emotion Transfer
Conditional Image Generation
Method
Overview
Emotion-Embedding Network
Dataset and Training
Evaluation
Generation Results
Comparisons
User Study
Ablation Study
Conclusion and Limitations
...and 1 more sections

Figures (9)

Figure 1: Valence-Arousal model.
Figure 2: Overview of our method. Specifically, we take the following steps. (a) We collect an image dataset annotated with V-A values, neutral prompts, and emotional prompts. These prompts are then encoded into features by prompt encoder $\mathcal{E}$. (b) Next, we design (b.1) an emotion-embedding network $\mathcal{M}$ to embed V/A values into textual features based on the transformer architecture, and (b.2) a specialized loss function to enhance the emotional resonance of generated images. The output of the mapping network serves as the condition for the image generation model $\mathcal{G}$ to generate emotional images.
Figure 3: Structure of Emotion Injection Block. It accepts hidden state $h_{i-1}$ as input and produces $h_{i}$ as output. The V-feature $e_v$ and A-feature $e_a$ represent the emotion features, which are injected through the cross-attention module.
Figure 4: Results under multiple inputs. (a) Overriding semantic content ('a child in the amusement park') with sad V-A (-2,-2); (b) Discrete emotion mapping in V-A space as emotion input; (c) Empty-prompt generation with pure emotion condition; (d) Fine-grained control of V-A variations with a granularity of 0.2.
Figure 5: Qualitative comparisons with baselines. These images are generated at varying V-A values, specifically -1.5, 0, and 1.5. Only our approach and the GPT-4+SDXL successfully generate images that clearly reflect emotional variations. Notably, our results show enhanced continuity, indicating superior controllability over continuous V-A values compared to the GPT-4+SDXL.
...and 4 more figures

EmotiCrafter: Text-to-Emotional-Image Generation based on Valence-Arousal Model

TL;DR

Abstract

EmotiCrafter: Text-to-Emotional-Image Generation based on Valence-Arousal Model

Authors

TL;DR

Abstract

Table of Contents

Figures (9)