MUSE: Manipulating Unified Framework for Synthesizing Emotions in Images via Test-Time Optimization
Yingjie Xia, Xi Wang, Jinglei Shi, Vicky Kalogeiton, Jian Yang
TL;DR
This paper introduces MUSE, a diffusion-based, unified framework for emotional image generation and editing that trades dataset training for test-time optimization of emotion tokens guided by an off-the-shelf emotion classifier. It leverages CLIP-based semantic similarity to time the injection of emotional guidance and employs a multi-emotion loss to suppress inherent and similar emotions, improving emotional accuracy while preserving content fidelity. Across EmoSet, COCO, and FI_8, MUSE achieves superior emotion evocation, semantic alignment, and visual realism, outperforming state-of-the-art generation and editing methods without requiring diffusion-model updates or specialized emotional datasets. The approach establishes a new paradigm for emotion synthesis, enabling stable, flexible, and high-quality emotional control in images with practical implications for therapeutic art, storytelling, and emotion-aware communication.
Abstract
Images evoke emotions that profoundly influence perception, often prioritized over content. Current Image Emotional Synthesis (IES) approaches artificially separate generation and editing tasks, creating inefficiencies and limiting applications where these tasks naturally intertwine, such as therapeutic interventions or storytelling. In this work, we introduce MUSE, the first unified framework capable of both emotional generation and editing. By adopting a strategy conceptually aligned with Test-Time Scaling (TTS) that widely used in both LLM and diffusion model communities, it avoids the requirement for additional updating diffusion model and specialized emotional synthesis datasets. More specifically, MUSE addresses three key questions in emotional synthesis: (1) HOW to stably guide synthesis by leveraging an off-the-shelf emotion classifier with gradient-based optimization of emotional tokens; (2) WHEN to introduce emotional guidance by identifying the optimal timing using semantic similarity as a supervisory signal; and (3) WHICH emotion to guide synthesis through a multi-emotion loss that reduces interference from inherent and similar emotions. Experimental results show that MUSE performs favorably against all methods for both generation and editing, improving emotional accuracy and semantic diversity while maintaining an optimal balance between desired content, adherence to text prompts, and realistic emotional expression. It establishes a new paradigm for emotion synthesis.
