Table of Contents
Fetching ...

EmoSpace: Fine-Grained Emotion Prototype Learning for Immersive Affective Content Generation

Bingyuan Wang, Xingbei Chen, Zongyang Qiu, Linping Yuan, Zeyu Wang

TL;DR

EmoSpace tackles the lack of fine-grained emotion control in immersive VR content generation by learning a dynamic, vision-language grounded prototype space that encodes nuanced affect. It introduces a hierarchical representation with a large prototype bank and cross-modal fusion, paired with a diffusion-based generation pipeline that uses multi-prototype guidance, temporal blending, and attention reweighting for precise emotion conditioning. The framework extends to immersive VR tasks, including emotional panorama generation, outpainting, and stylized content, with iterative prompt refinement to align prompts with target emotion prototypes. Extensive quantitative, qualitative, and VR-user studies demonstrate superior emotion accuracy and aesthetics over baselines, and reveal VR enhances subjective emotional experience despite higher cognitive load. These results support the potential of integrating fine-grained affect modeling with immersive technologies for applications in therapy, education, storytelling, and cultural preservation.

Abstract

Emotion is important for creating compelling virtual reality (VR) content. Although some generative methods have been applied to lower the barrier to creating emotionally rich content, they fail to capture the nuanced emotional semantics and the fine-grained control essential for immersive experiences. To address these limitations, we introduce EmoSpace, a novel framework for emotion-aware content generation that learns dynamic, interpretable emotion prototypes through vision-language alignment. We employ a hierarchical emotion representation with rich learnable prototypes that evolve during training, enabling fine-grained emotional control without requiring explicit emotion labels. We develop a controllable generation pipeline featuring multi-prototype guidance, temporal blending, and attention reweighting that supports diverse applications, including emotional image outpainting, stylized generation, and emotional panorama generation for VR environments. Our experiments demonstrate the superior performance of EmoSpace over existing methods in both qualitative and quantitative evaluations. Additionally, we present a comprehensive user study investigating how VR environments affect emotional perception compared to desktop settings. Our work facilitates immersive visual content generation with fine-grained emotion control and supports applications like therapy, education, storytelling, artistic creation, and cultural preservation. Code and models will be made publicly available.

EmoSpace: Fine-Grained Emotion Prototype Learning for Immersive Affective Content Generation

TL;DR

EmoSpace tackles the lack of fine-grained emotion control in immersive VR content generation by learning a dynamic, vision-language grounded prototype space that encodes nuanced affect. It introduces a hierarchical representation with a large prototype bank and cross-modal fusion, paired with a diffusion-based generation pipeline that uses multi-prototype guidance, temporal blending, and attention reweighting for precise emotion conditioning. The framework extends to immersive VR tasks, including emotional panorama generation, outpainting, and stylized content, with iterative prompt refinement to align prompts with target emotion prototypes. Extensive quantitative, qualitative, and VR-user studies demonstrate superior emotion accuracy and aesthetics over baselines, and reveal VR enhances subjective emotional experience despite higher cognitive load. These results support the potential of integrating fine-grained affect modeling with immersive technologies for applications in therapy, education, storytelling, and cultural preservation.

Abstract

Emotion is important for creating compelling virtual reality (VR) content. Although some generative methods have been applied to lower the barrier to creating emotionally rich content, they fail to capture the nuanced emotional semantics and the fine-grained control essential for immersive experiences. To address these limitations, we introduce EmoSpace, a novel framework for emotion-aware content generation that learns dynamic, interpretable emotion prototypes through vision-language alignment. We employ a hierarchical emotion representation with rich learnable prototypes that evolve during training, enabling fine-grained emotional control without requiring explicit emotion labels. We develop a controllable generation pipeline featuring multi-prototype guidance, temporal blending, and attention reweighting that supports diverse applications, including emotional image outpainting, stylized generation, and emotional panorama generation for VR environments. Our experiments demonstrate the superior performance of EmoSpace over existing methods in both qualitative and quantitative evaluations. Additionally, we present a comprehensive user study investigating how VR environments affect emotional perception compared to desktop settings. Our work facilitates immersive visual content generation with fine-grained emotion control and supports applications like therapy, education, storytelling, artistic creation, and cultural preservation. Code and models will be made publicly available.
Paper Structure (28 sections, 7 equations, 8 figures, 5 tables)

This paper contains 28 sections, 7 equations, 8 figures, 5 tables.

Figures (8)

  • Figure 1: Example Results Generated by EmoSpace. We demonstrate EmoSpace's capability for immersive affective content generation in (a) emotional panorama generation from the similar prompt "an emotional panorama" and different fine-grained emotional descriptions (styles for each row: Ghibli, 3D render, toy, ink painting, pixel art), (b) emotional image outpainting from an original image and different emotions in different directions, (c) stylized emotional panorama generation from the similar prompt "city skyline" and emotion "awe" in different styles (styles for each row: toy, ink painting, pixel art).
  • Figure 2: Overview of EmoSpace. Our framework consists of three main components: (a) emotion prototype learning that learns dynamic, interpretable emotion representations through vision-language alignment with rich learnable prototypes, (b) emotion-conditioned generation featuring multi-prototype guidance, iterative prompt refinement (Fig. \ref{['fig:prompt_refinement']}), temporal blending, and attention reweighting for fine-grained emotional control, and (c) immersive application scenarios supporting panorama creation, emotional image outpainting, and stylized generation for VR environments.
  • Figure 3: Iterative Prompt Refinement in the Latent Space. We iteratively optimize prompts through GPT association, generating candidate prompts and emotional descriptions, evaluating semantic alignment with target embedding, and selecting optimal solutions until convergence to achieve precise emotion-prompt correspondence.
  • Figure 4: Example Results Generated from Prototypes. We demonstrate EmoSpace's capability of fine-grained emotion modeling and control from the similar prompt "an emotional face in Studio Ghibli style" (no iterative refinement) and four randomly selected prototypes. Red arrows indicate consistent emotional features of each prototype.
  • Figure 5: Comparative Study Results. Qualitative comparison between different methods for emotional image generation. Among detailed content prompts and fine-grained emotional descriptions, our method demonstrates superior emotion fidelity and visual quality across diverse content and emotional inputs.
  • ...and 3 more figures