Table of Contents
Fetching ...

Aesthetic Matters in Music Perception for Image Stylization: A Emotion-driven Music-to-Visual Manipulation

Junjie Xu, Xingjiao Wu, Tanren Yao, Zihao Zhang, Jiayang Bei, Wu Wen, Liang He

TL;DR

This work tackles translating musical emotion into visual content to enhance multimodal emotional understanding. It introduces EmoMV, a two-stage framework with Mus-Vis Textual Alignment (MVTA) and Emotion-aware Aesthetic Image Refinement (EAIR) to map music-derived emotion to visuals. A multi-scale evaluation, including image quality metrics and EEG-based emotional responses, demonstrates EmoMV's ability to produce emotionally coherent, aesthetically refined images across a 3.8k music-image dataset and various case studies. The findings suggest practical benefits for art therapy and interactive media, advancing cross-modal emotional integration in creative technologies.

Abstract

Emotional information is essential for enhancing human-computer interaction and deepening image understanding. However, while deep learning has advanced image recognition, the intuitive understanding and precise control of emotional expression in images remain challenging. Similarly, music research largely focuses on theoretical aspects, with limited exploration of its emotional dimensions and their integration with visual arts. To address these gaps, we introduce EmoMV, an emotion-driven music-to-visual manipulation method that manipulates images based on musical emotions. EmoMV combines bottom-up processing of music elements-such as pitch and rhythm-with top-down application of these emotions to visual aspects like color and lighting. We evaluate EmoMV using a multi-scale framework that includes image quality metrics, aesthetic assessments, and EEG measurements to capture real-time emotional responses. Our results demonstrate that EmoMV effectively translates music's emotional content into visually compelling images, advancing multimodal emotional integration and opening new avenues for creative industries and interactive technologies.

Aesthetic Matters in Music Perception for Image Stylization: A Emotion-driven Music-to-Visual Manipulation

TL;DR

This work tackles translating musical emotion into visual content to enhance multimodal emotional understanding. It introduces EmoMV, a two-stage framework with Mus-Vis Textual Alignment (MVTA) and Emotion-aware Aesthetic Image Refinement (EAIR) to map music-derived emotion to visuals. A multi-scale evaluation, including image quality metrics and EEG-based emotional responses, demonstrates EmoMV's ability to produce emotionally coherent, aesthetically refined images across a 3.8k music-image dataset and various case studies. The findings suggest practical benefits for art therapy and interactive media, advancing cross-modal emotional integration in creative technologies.

Abstract

Emotional information is essential for enhancing human-computer interaction and deepening image understanding. However, while deep learning has advanced image recognition, the intuitive understanding and precise control of emotional expression in images remain challenging. Similarly, music research largely focuses on theoretical aspects, with limited exploration of its emotional dimensions and their integration with visual arts. To address these gaps, we introduce EmoMV, an emotion-driven music-to-visual manipulation method that manipulates images based on musical emotions. EmoMV combines bottom-up processing of music elements-such as pitch and rhythm-with top-down application of these emotions to visual aspects like color and lighting. We evaluate EmoMV using a multi-scale framework that includes image quality metrics, aesthetic assessments, and EEG measurements to capture real-time emotional responses. Our results demonstrate that EmoMV effectively translates music's emotional content into visually compelling images, advancing multimodal emotional integration and opening new avenues for creative industries and interactive technologies.
Paper Structure (13 sections, 3 equations, 4 figures, 4 tables)

This paper contains 13 sections, 3 equations, 4 figures, 4 tables.

Figures (4)

  • Figure 1: Overview of our framework Emotion-driven Music-to-Visual manipulation method, termed as EmoMV. EmoMV is composed of Mus-Vis Textual Alignment, Emotional Aesthetic Description via Lighting, and Harmonizing Visuals with Musical Aesthetics, facilitating the extraction of emotional information and exhibit on the generated image.
  • Figure 2: A qualitative analysis is conducted by contrasting the original image from HealSoul and the generated image through EmoMV. With the paired healing music, EmoMV generates images with more light, better composition.
  • Figure 3: Ablation study for the effect of prompt input into the iclight, we put images from HealSoul and generated from EmoMV branches without EAIR (w/o EAIR) or without MVTA (w/o MVTA).
  • Figure 4: Case Study of EmoMV for Emotion-Driven Music-to-Image Generation. Left: Visual outputs generated from the same image with four different musical inputs, illustrating how variations in musical emotion (e.g., tempo, rhythm, mood) influence the aesthetic styling and emotional tone of the generated images using the EmoMV method. Right: Visual outputs generated from the same musical input but with four different scene images, demonstrating how varying visual contexts affect the interpretation of the music’s emotional cues and alter the visual aesthetic representation in the EmoMV framework.