Table of Contents
Fetching ...

MUSE: Manipulating Unified Framework for Synthesizing Emotions in Images via Test-Time Optimization

Yingjie Xia, Xi Wang, Jinglei Shi, Vicky Kalogeiton, Jian Yang

TL;DR

This paper introduces MUSE, a diffusion-based, unified framework for emotional image generation and editing that trades dataset training for test-time optimization of emotion tokens guided by an off-the-shelf emotion classifier. It leverages CLIP-based semantic similarity to time the injection of emotional guidance and employs a multi-emotion loss to suppress inherent and similar emotions, improving emotional accuracy while preserving content fidelity. Across EmoSet, COCO, and FI_8, MUSE achieves superior emotion evocation, semantic alignment, and visual realism, outperforming state-of-the-art generation and editing methods without requiring diffusion-model updates or specialized emotional datasets. The approach establishes a new paradigm for emotion synthesis, enabling stable, flexible, and high-quality emotional control in images with practical implications for therapeutic art, storytelling, and emotion-aware communication.

Abstract

Images evoke emotions that profoundly influence perception, often prioritized over content. Current Image Emotional Synthesis (IES) approaches artificially separate generation and editing tasks, creating inefficiencies and limiting applications where these tasks naturally intertwine, such as therapeutic interventions or storytelling. In this work, we introduce MUSE, the first unified framework capable of both emotional generation and editing. By adopting a strategy conceptually aligned with Test-Time Scaling (TTS) that widely used in both LLM and diffusion model communities, it avoids the requirement for additional updating diffusion model and specialized emotional synthesis datasets. More specifically, MUSE addresses three key questions in emotional synthesis: (1) HOW to stably guide synthesis by leveraging an off-the-shelf emotion classifier with gradient-based optimization of emotional tokens; (2) WHEN to introduce emotional guidance by identifying the optimal timing using semantic similarity as a supervisory signal; and (3) WHICH emotion to guide synthesis through a multi-emotion loss that reduces interference from inherent and similar emotions. Experimental results show that MUSE performs favorably against all methods for both generation and editing, improving emotional accuracy and semantic diversity while maintaining an optimal balance between desired content, adherence to text prompts, and realistic emotional expression. It establishes a new paradigm for emotion synthesis.

MUSE: Manipulating Unified Framework for Synthesizing Emotions in Images via Test-Time Optimization

TL;DR

This paper introduces MUSE, a diffusion-based, unified framework for emotional image generation and editing that trades dataset training for test-time optimization of emotion tokens guided by an off-the-shelf emotion classifier. It leverages CLIP-based semantic similarity to time the injection of emotional guidance and employs a multi-emotion loss to suppress inherent and similar emotions, improving emotional accuracy while preserving content fidelity. Across EmoSet, COCO, and FI_8, MUSE achieves superior emotion evocation, semantic alignment, and visual realism, outperforming state-of-the-art generation and editing methods without requiring diffusion-model updates or specialized emotional datasets. The approach establishes a new paradigm for emotion synthesis, enabling stable, flexible, and high-quality emotional control in images with practical implications for therapeutic art, storytelling, and emotion-aware communication.

Abstract

Images evoke emotions that profoundly influence perception, often prioritized over content. Current Image Emotional Synthesis (IES) approaches artificially separate generation and editing tasks, creating inefficiencies and limiting applications where these tasks naturally intertwine, such as therapeutic interventions or storytelling. In this work, we introduce MUSE, the first unified framework capable of both emotional generation and editing. By adopting a strategy conceptually aligned with Test-Time Scaling (TTS) that widely used in both LLM and diffusion model communities, it avoids the requirement for additional updating diffusion model and specialized emotional synthesis datasets. More specifically, MUSE addresses three key questions in emotional synthesis: (1) HOW to stably guide synthesis by leveraging an off-the-shelf emotion classifier with gradient-based optimization of emotional tokens; (2) WHEN to introduce emotional guidance by identifying the optimal timing using semantic similarity as a supervisory signal; and (3) WHICH emotion to guide synthesis through a multi-emotion loss that reduces interference from inherent and similar emotions. Experimental results show that MUSE performs favorably against all methods for both generation and editing, improving emotional accuracy and semantic diversity while maintaining an optimal balance between desired content, adherence to text prompts, and realistic emotional expression. It establishes a new paradigm for emotion synthesis.

Paper Structure

This paper contains 15 sections, 13 equations, 6 figures, 5 tables.

Figures (6)

  • Figure 1: Our unified framework, MUSE, addresses the how, when, and which questions in emotional image synthesis. It starts from a noise latent $z_T$ that is either sampled from a Gaussian distribution (for generation) or obtained via inversion from an existing image (for editing). A CLIP model is employed to compute semantic similarity $s^{clip}_t$ between the synthesized image and the text prompt for determining when to introduce the emotional guidance. To decide how to manipulate the emotion, we introduce a learnable emotion token in addition to the textual prompt tokens, during the inference stage. These tokens are optimized in an inner loop using an off-the-shelf emotional classifier that operates on the predicted denoised latent $\hat{z}_0$. Furthermore, we propose an emotional loss $\mathcal{L}_{\text{emo}}$ to ensure which emotions to enhance, which comprises the terms $\mathcal{L}_{\text{target}}$ for synthesizing the target emotion, and $\mathcal{L}_{\text{inh}}$, $\mathcal{L}_{\text{sim}}$ for suppressing inherent and similar emotions respectively.
  • Figure 2: Visualization of suppressing similar and inherent emotions. (a) and (b) show the generation of a "park" image w.r.t the target emotion "anger", with and without inherent emotion suppression. (c) and (d) show the generation of a "dog staring at you" image w.r.t the target emotion "sadness", with and without similar emotion suppression. The x-axis denotes the number of inner emotional optimization loops, and the y-axis represents the emotion probability predicted by the classifier.
  • Figure 3: Qualitative comparison of emotional image generation and editing methods. Upper Part: Emotion generation results from SD rombach2022high, UG bansal2023universal, PixArt-$\alpha$chen2024pixart, SDXL rombach2022high, SD3 esser2024scaling, FLUX flux2024, and EmoGen yang2024emogen. UG and MUSE are conditioned on both text prompts and emotion labels, while other methods rely solely on emotionally descriptive prompts. Lower Part: Emotion editing results from SDE mengsdedit, AIF weng2023affective, IDE zou2024towards, Forgedit zhang2023forgedit, and EmoEdit yang2025emoedit. All editing methods take a neutral input image and modify it according to the target emotion.
  • Figure 3: Ablation study. The 1st row shows the performance of MUSE using the classifier from EmoGen yang2024emogen, the 2nd-4th rows show the results of MUSE trained using different loss term combinations, and the 5th row is the results of MUSE when removing the emotional token optimization. The final row shows results using our proposed MUSE setting.
  • Figure 4: T-SNE van2008visualizing visualization of feature distributions for non-textual emotional generation across four models. Each data point represents the CLIP semantic features of an image, with different colors indicating their corresponding emotion categories. The Var is the averaged intra-class variance of CLIP features. Compared to the other three methods (a), (b), and (c), MUSE achieves better separation of content within the same emotion category while exhibiting a more dispersed overall semantic distribution.
  • ...and 1 more figures