Table of Contents
Fetching ...

Art2Music: Generating Music for Art Images with Multi-modal Feeling Alignment

Jiaying Hong, Ting Zhu, Thanet Markchom, Huizhi Liang

TL;DR

Art2Music presents a lightweight, feeling-aligned cross-modal framework for generating music from artistic images and commentary. It introduces ArtiCaps, a pseudo-aligned tri-modal dataset built by semantic matching ArtEmis and MusicCaps, and a two-stage pipeline (mel-spectrogram generation via a gated fusion of image/text embeddings followed by HiFi-GAN waveform reconstruction). The approach achieves strong perceptual and spectral fidelity while maintaining cross-modal feeling alignment, even with limited training data (50k for Mel-spectrograms). A small LLM-based case study corroborates cross-modal feeling consistency, underscoring the method's practicality for interactive art installations and personalized soundscapes. The work emphasizes lightweight, scalable cross-modal generation without explicit emotion labels, outlining future improvements in data alignment and stylistic diversity.

Abstract

With the rise of AI-generated content (AIGC), generating perceptually natural and feeling-aligned music from multimodal inputs has become a central challenge. Existing approaches often rely on explicit emotion labels that require costly annotation, underscoring the need for more flexible feeling-aligned methods. To support multimodal music generation, we construct ArtiCaps, a pseudo feeling-aligned image-music-text dataset created by semantically matching descriptions from ArtEmis and MusicCaps. We further propose Art2Music, a lightweight cross-modal framework that synthesizes music from artistic images and user comments. In the first stage, images and text are encoded with OpenCLIP and fused using a gated residual module; the fused representation is decoded by a bidirectional LSTM into Mel-spectrograms with a frequency-weighted L1 loss to enhance high-frequency fidelity. In the second stage, a fine-tuned HiFi-GAN vocoder reconstructs high-quality audio waveforms. Experiments on ArtiCaps show clear improvements in Mel-Cepstral Distortion, Frechet Audio Distance, Log-Spectral Distance, and cosine similarity. A small LLM-based rating study further verifies consistent cross-modal feeling alignment and offers interpretable explanations of matches and mismatches across modalities. These results demonstrate improved perceptual naturalness, spectral fidelity, and semantic consistency. Art2Music also maintains robust performance with only 50k training samples, providing a scalable solution for feeling-aligned creative audio generation in interactive art, personalized soundscapes, and digital art exhibitions.

Art2Music: Generating Music for Art Images with Multi-modal Feeling Alignment

TL;DR

Art2Music presents a lightweight, feeling-aligned cross-modal framework for generating music from artistic images and commentary. It introduces ArtiCaps, a pseudo-aligned tri-modal dataset built by semantic matching ArtEmis and MusicCaps, and a two-stage pipeline (mel-spectrogram generation via a gated fusion of image/text embeddings followed by HiFi-GAN waveform reconstruction). The approach achieves strong perceptual and spectral fidelity while maintaining cross-modal feeling alignment, even with limited training data (50k for Mel-spectrograms). A small LLM-based case study corroborates cross-modal feeling consistency, underscoring the method's practicality for interactive art installations and personalized soundscapes. The work emphasizes lightweight, scalable cross-modal generation without explicit emotion labels, outlining future improvements in data alignment and stylistic diversity.

Abstract

With the rise of AI-generated content (AIGC), generating perceptually natural and feeling-aligned music from multimodal inputs has become a central challenge. Existing approaches often rely on explicit emotion labels that require costly annotation, underscoring the need for more flexible feeling-aligned methods. To support multimodal music generation, we construct ArtiCaps, a pseudo feeling-aligned image-music-text dataset created by semantically matching descriptions from ArtEmis and MusicCaps. We further propose Art2Music, a lightweight cross-modal framework that synthesizes music from artistic images and user comments. In the first stage, images and text are encoded with OpenCLIP and fused using a gated residual module; the fused representation is decoded by a bidirectional LSTM into Mel-spectrograms with a frequency-weighted L1 loss to enhance high-frequency fidelity. In the second stage, a fine-tuned HiFi-GAN vocoder reconstructs high-quality audio waveforms. Experiments on ArtiCaps show clear improvements in Mel-Cepstral Distortion, Frechet Audio Distance, Log-Spectral Distance, and cosine similarity. A small LLM-based rating study further verifies consistent cross-modal feeling alignment and offers interpretable explanations of matches and mismatches across modalities. These results demonstrate improved perceptual naturalness, spectral fidelity, and semantic consistency. Art2Music also maintains robust performance with only 50k training samples, providing a scalable solution for feeling-aligned creative audio generation in interactive art, personalized soundscapes, and digital art exhibitions.

Paper Structure

This paper contains 26 sections, 2 equations, 4 figures, 10 tables.

Figures (4)

  • Figure 1: Distribution of cosine similarity scores between painting-side and audio-side emotional keywords.
  • Figure 2: Emotion polarity alignment heatmap showing the average similarity between sentiment polarities of painting and audio modalities. Diagonal entries (e.g., positive–positive, neutral–neutral) exhibit higher values, suggesting retained polarity consistency during weakly supervised semantic matching.
  • Figure 3: Overview of the proposed Art2Music framework, consisting of two stages: (1) multimodal feeling alignment and Mel-spectrogram generation from image and textual input; (2) waveform reconstruction using a HiFi-GAN vocoder.
  • Figure 4: Mel-spectrogram comparison between generated audio (left) and referred audio (right). The generated spectrogram preserves overall spectral contour and harmonic structures, demonstrating strong consistency in time-frequency representation.