Table of Contents
Fetching ...

Art2Mus: Artwork-to-Music Generation via Visual Conditioning and Large-Scale Cross-Modal Alignment

Ivan Rinaldi, Matteo Mendula, Nicola Fanelli, Florence Levé, Matteo Testi, Giovanna Castellano, Gennaro Vessio

TL;DR

ArtToMus is proposed, the first framework explicitly designed for direct artwork-to-music generation, which maps digitized artworks to music without image-to-text translation or language-based semantic supervision, and achieves competitive perceptual quality and meaningful cross-modal correspondence.

Abstract

Music generation has advanced markedly through multimodal deep learning, enabling models to synthesize audio from text and, more recently, from images. However, existing image-conditioned systems suffer from two fundamental limitations: (i) they are typically trained on natural photographs, limiting their ability to capture the richer semantic, stylistic, and cultural content of artworks; and (ii) most rely on an image-to-text conversion stage, using language as a semantic shortcut that simplifies conditioning but prevents direct visual-to-audio learning. Motivated by these gaps, we introduce ArtSound, a large-scale multimodal dataset of 105,884 artwork-music pairs enriched with dual-modality captions, obtained by extending ArtGraph and the Free Music Archive. We further propose ArtToMus, the first framework explicitly designed for direct artwork-to-music generation, which maps digitized artworks to music without image-to-text translation or language-based semantic supervision. The framework projects visual embeddings into the conditioning space of a latent diffusion model, enabling music synthesis guided solely by visual information. Experimental results show that ArtToMus generates musically coherent and stylistically consistent outputs that reflect salient visual cues of the source artworks. While absolute alignment scores remain lower than those of text-conditioned systems-as expected given the substantially increased difficulty of removing linguistic supervision-ArtToMus achieves competitive perceptual quality and meaningful cross-modal correspondence. This work establishes direct visual-to-music generation as a distinct and challenging research direction, and provides resources that support applications in multimedia art, cultural heritage, and AI-assisted creative practice. Code and dataset will be publicly released upon acceptance.

Art2Mus: Artwork-to-Music Generation via Visual Conditioning and Large-Scale Cross-Modal Alignment

TL;DR

ArtToMus is proposed, the first framework explicitly designed for direct artwork-to-music generation, which maps digitized artworks to music without image-to-text translation or language-based semantic supervision, and achieves competitive perceptual quality and meaningful cross-modal correspondence.

Abstract

Music generation has advanced markedly through multimodal deep learning, enabling models to synthesize audio from text and, more recently, from images. However, existing image-conditioned systems suffer from two fundamental limitations: (i) they are typically trained on natural photographs, limiting their ability to capture the richer semantic, stylistic, and cultural content of artworks; and (ii) most rely on an image-to-text conversion stage, using language as a semantic shortcut that simplifies conditioning but prevents direct visual-to-audio learning. Motivated by these gaps, we introduce ArtSound, a large-scale multimodal dataset of 105,884 artwork-music pairs enriched with dual-modality captions, obtained by extending ArtGraph and the Free Music Archive. We further propose ArtToMus, the first framework explicitly designed for direct artwork-to-music generation, which maps digitized artworks to music without image-to-text translation or language-based semantic supervision. The framework projects visual embeddings into the conditioning space of a latent diffusion model, enabling music synthesis guided solely by visual information. Experimental results show that ArtToMus generates musically coherent and stylistically consistent outputs that reflect salient visual cues of the source artworks. While absolute alignment scores remain lower than those of text-conditioned systems-as expected given the substantially increased difficulty of removing linguistic supervision-ArtToMus achieves competitive perceptual quality and meaningful cross-modal correspondence. This work establishes direct visual-to-music generation as a distinct and challenging research direction, and provides resources that support applications in multimedia art, cultural heritage, and AI-assisted creative practice. Code and dataset will be publicly released upon acceptance.
Paper Structure (24 sections, 14 equations, 8 figures, 12 tables)

This paper contains 24 sections, 14 equations, 8 figures, 12 tables.

Figures (8)

  • Figure 1: Unified captioning pipeline for images and audio. Images are captioned using an MLLM guided by a structured prompt; audio is captioned through a segment-level music captioner and then fused by an LLM. All captions are evaluated using the proposed $\mathcal{ICS}\textit{core}$ and $\mathcal{ACS}\textit{core}$ metrics; captions below threshold are regenerated with refined prompts.
  • Figure 2: Violin plots of similarity score distributions for image–audio pairing strategies. Scores, computed from combinations of image, audio, and caption embeddings using ImageBind, are grouped into low ($\le$0.25), medium (0.25–0.6), and high ($\ge$0.6) ranges, highlighting which model–modality pairs produce stronger or weaker alignments.
  • Figure 3: Overview of the $\mathcal{A}\textit{rt2}\mathcal{M}\textit{us}$ architecture. Inspired by the generative paradigm of AudioLDM 2, $\mathcal{A}\textit{rt2}\mathcal{M}\textit{us}$ reformulates the conditioning interface to enable direct artwork-to-music generation. A Visual Conditioning Extractor and an Image Aligner learn to project image embeddings into GPT-2's LoA embedding space, establishing cross-modal alignment without textual intermediates. The pretrained AudioLDM 2 backbone is used as a frozen generative prior. GPT-2 operates as a modality-agnostic LoA translator, integrating visual conditioning within the latent diffusion process. Music is encoded into mel-spectrogram latents by the Latent Extractor and synthesized into audio by the LoA-to-Audio Generator.
  • Figure 4: The Visual Conditioning Extractor and Image Aligner modules.
  • Figure 5: Distribution of artworks by music genre, segmented by artistic style. Abstract Art, Abstract Expressionism, Art Nouveau (Modern), and Expressionism appear prominently across most genres. Conversely, Early Renaissance and Symbolism are less frequent.
  • ...and 3 more figures