Table of Contents
Fetching ...

MPJudge: Towards Perceptual Assessment of Music-Induced Paintings

Shiqi Jiang, Tianyi Liang, Huayuan Ye, Changbo Wang, Chenhui Li

TL;DR

The paper tackles perceptual assessment of music-induced paintings by defining perceptual coherence and constructing the MPD dataset annotated by domain experts. It introduces MPJudge, a music-conditioned visual encoder using Modality-Adaptive Normalization and trained with Direct Preference Optimization to leverage ambiguous judgments. Extensive experiments and user studies show superior performance over state-of-the-art baselines and improved interpretability via modulation maps. The work provides a foundation for reliable cross-modal evaluation in music-informed visual art and suggests future extensions to broader visual forms and interactive systems.

Abstract

Music induced painting is a unique artistic practice, where visual artworks are created under the influence of music. Evaluating whether a painting faithfully reflects the music that inspired it poses a challenging perceptual assessment task. Existing methods primarily rely on emotion recognition models to assess the similarity between music and painting, but such models introduce considerable noise and overlook broader perceptual cues beyond emotion. To address these limitations, we propose a novel framework for music induced painting assessment that directly models perceptual coherence between music and visual art. We introduce MPD, the first large scale dataset of music painting pairs annotated by domain experts based on perceptual coherence. To better handle ambiguous cases, we further collect pairwise preference annotations. Building on this dataset, we present MPJudge, a model that integrates music features into a visual encoder via a modulation based fusion mechanism. To effectively learn from ambiguous cases, we adopt Direct Preference Optimization for training. Extensive experiments demonstrate that our method outperforms existing approaches. Qualitative results further show that our model more accurately identifies music relevant regions in paintings.

MPJudge: Towards Perceptual Assessment of Music-Induced Paintings

TL;DR

The paper tackles perceptual assessment of music-induced paintings by defining perceptual coherence and constructing the MPD dataset annotated by domain experts. It introduces MPJudge, a music-conditioned visual encoder using Modality-Adaptive Normalization and trained with Direct Preference Optimization to leverage ambiguous judgments. Extensive experiments and user studies show superior performance over state-of-the-art baselines and improved interpretability via modulation maps. The work provides a foundation for reliable cross-modal evaluation in music-informed visual art and suggests future extensions to broader visual forms and interactive systems.

Abstract

Music induced painting is a unique artistic practice, where visual artworks are created under the influence of music. Evaluating whether a painting faithfully reflects the music that inspired it poses a challenging perceptual assessment task. Existing methods primarily rely on emotion recognition models to assess the similarity between music and painting, but such models introduce considerable noise and overlook broader perceptual cues beyond emotion. To address these limitations, we propose a novel framework for music induced painting assessment that directly models perceptual coherence between music and visual art. We introduce MPD, the first large scale dataset of music painting pairs annotated by domain experts based on perceptual coherence. To better handle ambiguous cases, we further collect pairwise preference annotations. Building on this dataset, we present MPJudge, a model that integrates music features into a visual encoder via a modulation based fusion mechanism. To effectively learn from ambiguous cases, we adopt Direct Preference Optimization for training. Extensive experiments demonstrate that our method outperforms existing approaches. Qualitative results further show that our model more accurately identifies music relevant regions in paintings.

Paper Structure

This paper contains 22 sections, 8 equations, 5 figures, 3 tables.

Figures (5)

  • Figure 1: Examples of music-induced painting assessment with ground truth (and predicted) scores at the bottom.
  • Figure 2:
  • Figure 3: Pipeline of our model. The mel spectrogram is processed by the music encoder to extract music features. The painting is passed through the painting encoder, where the extracted music features are incorporated via a fusion module. A regression head then predicts a perception score for each music-painting pair. We optimize the model using a regression loss based on the ground truth scores, and additionally apply a DPO loss to learn from pairwise preference annotations in ambiguous cases.
  • Figure 4: Statistical analysis of user study on music-painting matching.
  • Figure 5: Visualization of Modulation Intensity Maps (MIMs) across layers. We show MIM results from the first three and last three Transformer blocks in the painting encoder. Brighter regions indicate stronger modulation by the music input. The ground truth (and predicted) scores are: 0.2 (0.17), 0.9 (0.94), 0.1 (0.14).