Table of Contents
Fetching ...

Towards Deeper Emotional Reflection: Crafting Affective Image Filters with Generative Priors

Peixuan Zhang, Shuchen Weng, Jiajun Tang, Si Li, Boxin Shi

TL;DR

The paper introduces the Affective Image Filter (AIF) task, which aims to translate visually-abstract emotions described in text into emotionally faithful, content-preserving images. It presents the AIF dataset and two models: AIF-B, a multi-modal transformer baseline, and AIF-D, a diffusion-prior architecture with a content-preservation module, LLM-based emotional reasoning, a voting ensemble, and redesigned aesthetics. Across quantitative metrics and user studies, AIF-D achieves superior content fidelity and emotional alignment compared with state-of-the-art baselines, demonstrating robust performance and practical potential for retouching and social sharing. The work highlights the value of combining explicit emotional cues with generative priors and advanced prompting to realize deeper emotional reflection in visual media.

Abstract

Social media platforms enable users to express emotions by posting text with accompanying images. In this paper, we propose the Affective Image Filter (AIF) task, which aims to reflect visually-abstract emotions from text into visually-concrete images, thereby creating emotionally compelling results. We first introduce the AIF dataset and the formulation of the AIF models. Then, we present AIF-B as an initial attempt based on a multi-modal transformer architecture. After that, we propose AIF-D as an extension of AIF-B towards deeper emotional reflection, effectively leveraging generative priors from pre-trained large-scale diffusion models. Quantitative and qualitative experiments demonstrate that AIF models achieve superior performance for both content consistency and emotional fidelity compared to state-of-the-art methods. Extensive user study experiments demonstrate that AIF models are significantly more effective at evoking specific emotions. Based on the presented results, we comprehensively discuss the value and potential of AIF models.

Towards Deeper Emotional Reflection: Crafting Affective Image Filters with Generative Priors

TL;DR

The paper introduces the Affective Image Filter (AIF) task, which aims to translate visually-abstract emotions described in text into emotionally faithful, content-preserving images. It presents the AIF dataset and two models: AIF-B, a multi-modal transformer baseline, and AIF-D, a diffusion-prior architecture with a content-preservation module, LLM-based emotional reasoning, a voting ensemble, and redesigned aesthetics. Across quantitative metrics and user studies, AIF-D achieves superior content fidelity and emotional alignment compared with state-of-the-art baselines, demonstrating robust performance and practical potential for retouching and social sharing. The work highlights the value of combining explicit emotional cues with generative priors and advanced prompting to realize deeper emotional reflection in visual media.

Abstract

Social media platforms enable users to express emotions by posting text with accompanying images. In this paper, we propose the Affective Image Filter (AIF) task, which aims to reflect visually-abstract emotions from text into visually-concrete images, thereby creating emotionally compelling results. We first introduce the AIF dataset and the formulation of the AIF models. Then, we present AIF-B as an initial attempt based on a multi-modal transformer architecture. After that, we propose AIF-D as an extension of AIF-B towards deeper emotional reflection, effectively leveraging generative priors from pre-trained large-scale diffusion models. Quantitative and qualitative experiments demonstrate that AIF models achieve superior performance for both content consistency and emotional fidelity compared to state-of-the-art methods. Extensive user study experiments demonstrate that AIF models are significantly more effective at evoking specific emotions. Based on the presented results, we comprehensively discuss the value and potential of AIF models.

Paper Structure

This paper contains 33 sections, 28 equations, 9 figures, 3 tables.

Figures (9)

  • Figure 1: Towards deeper emotional reflection, we propose AIF-D to better meet the requirements of emotional fidelity and content consistency. Compared to previous AIF-B aif, AIF-D overcomes three major challenges: (a) Blurry details. AIF-D preserves high-frequency details of the content image more effectively. (b) Keyword blindness. AIF-D has a deeper emotional understanding of text descriptions that lack specific keywords. (c) Overly-stylized effects. AIF-D reflects emotions with more appropriate artistic representation.
  • Figure 2: The pipeline of AIF-B aif. (a) Users input content images to provide visual content and text descriptions to evoke specific emotions. (b) Content images and text descriptions are encoded by the image encoder and text encoder, respectively. The encoded image and text tokens are fed into multi-modal transformer blocks, each comprising a Multi-headed Self-Attention (MSA) layer, a Multi-Layer Perceptron (MLP) layer, and residual connections (\ref{['sec:b-architecture']}). (c) To understand inherent emotional properties, AIF-B leverages the VAD dictionary as emotional prior knowledge. The emotional distribution loss is introduced to capture high-dimensional emotional cues (\ref{['sec:b-understanding']}). (d) The sentiment metric loss and anchor-based sentiment loss are proposed to effectively reflect specific emotions (\ref{['sec:b-emotion']}). The aesthetic loss is further adopted to ensure the aesthetic quality and enrich the artistic style of synthesized images (\ref{['sec:b-visualization']}).
  • Figure 3: The pipeline of AIF-D. (a) Users input content images to provide visual content and text descriptions to evoke specific emotions. (b) Content images and text descriptions are fed into the image encoder and text encoder to extract image tokens and text tokens, respectively. The noise prediction network estimates the noise at each diffusion step. Within the downsampling modules, each Vanilla Convolution (VC) block before the Cross-Attention (CA) blocks is equipped with a Content Preservation (CP) module. These CP modules integrate the context of content images to preserve high-frequency details (\ref{['sec:d-architecture']}). (c) An LLM and CoT prompting are used to handle complex emotional expressions with in-depth emotional reasoning. The voting ensemble mechanism is introduced to evaluate emotional distributions from different points of view and ensembles low-level (colors and texture) and mid-level features (image style and patch features) to enable accurate analysis of evoked emotions (\ref{['sec:d-understanding']}). (d) To enable accurate visualization of emotions, we propose an emotional reflection strategy to enhance the emotional understanding of the image decoder, using the sentiment metric loss and anchor-based sentiment loss (\ref{['sec:d-emotion']}). Finally, images are synthesized using a redesigned aesthetic loss, achieving a balance between artistic style and content consistency (\ref{['sec:d-visualization']}).
  • Figure 4: Visualization of applying the texture mapping loss across different blocks of the image decoder. (a) User-provided content images. (b) Text descriptions that reflect thoughts and feelings. (c)-(f) Results from applying the texture mapping loss at progressively later decoder blocks. (g) Results of AIF-D with appropriate artistic style and content consistency by combining losses across blocks.
  • Figure 5: Qualitative comparison results with state-of-the-art methods. (a) User-provided content images. (b) Text descriptions that reflect thoughts and feelings. (c) ManiGAN manigan (d) DiffusioinCLIP diffusionclip. (e) CLIPstyler clipstyler. (f) CLVA clva. (g) AIF-B aif. (h) AIF-D.
  • ...and 4 more figures