Table of Contents
Fetching ...

MMVA: Multimodal Matching Based on Valence and Arousal across Images, Music, and Musical Captions

Suhwan Choi, Kyu Won Kim, Myungjoo Kang

TL;DR

MMVA proposes a tri-modal, emotion-aware framework for images, music, and musical captions, built on IMEMNet-C with continuous valence-arousal matching to pair-and-train without strict one-to-one correspondences. It introduces asymmetric multimodal training and random VA-based matching, trained with VA-prediction and similarity losses to align modalities via low-dimensional emotional representations. The approach achieves state-of-the-art VA prediction and demonstrates strong zero-shot capabilities across tasks like text-to-music retrieval, image-to-music generation prompts, and video summarization, while enabling scalable data usage through random sampling. The work also contributes IMEMNet-C with high-quality musical captions, enabling new cross-modal analyses and applications in music-aware multimedia understanding.

Abstract

We introduce Multimodal Matching based on Valence and Arousal (MMVA), a tri-modal encoder framework designed to capture emotional content across images, music, and musical captions. To support this framework, we expand the Image-Music-Emotion-Matching-Net (IMEMNet) dataset, creating IMEMNet-C which includes 24,756 images and 25,944 music clips with corresponding musical captions. We employ multimodal matching scores based on the continuous valence (emotional positivity) and arousal (emotional intensity) values. This continuous matching score allows for random sampling of image-music pairs during training by computing similarity scores from the valence-arousal values across different modalities. Consequently, the proposed approach achieves state-of-the-art performance in valence-arousal prediction tasks. Furthermore, the framework demonstrates its efficacy in various zeroshot tasks, highlighting the potential of valence and arousal predictions in downstream applications.

MMVA: Multimodal Matching Based on Valence and Arousal across Images, Music, and Musical Captions

TL;DR

MMVA proposes a tri-modal, emotion-aware framework for images, music, and musical captions, built on IMEMNet-C with continuous valence-arousal matching to pair-and-train without strict one-to-one correspondences. It introduces asymmetric multimodal training and random VA-based matching, trained with VA-prediction and similarity losses to align modalities via low-dimensional emotional representations. The approach achieves state-of-the-art VA prediction and demonstrates strong zero-shot capabilities across tasks like text-to-music retrieval, image-to-music generation prompts, and video summarization, while enabling scalable data usage through random sampling. The work also contributes IMEMNet-C with high-quality musical captions, enabling new cross-modal analyses and applications in music-aware multimedia understanding.

Abstract

We introduce Multimodal Matching based on Valence and Arousal (MMVA), a tri-modal encoder framework designed to capture emotional content across images, music, and musical captions. To support this framework, we expand the Image-Music-Emotion-Matching-Net (IMEMNet) dataset, creating IMEMNet-C which includes 24,756 images and 25,944 music clips with corresponding musical captions. We employ multimodal matching scores based on the continuous valence (emotional positivity) and arousal (emotional intensity) values. This continuous matching score allows for random sampling of image-music pairs during training by computing similarity scores from the valence-arousal values across different modalities. Consequently, the proposed approach achieves state-of-the-art performance in valence-arousal prediction tasks. Furthermore, the framework demonstrates its efficacy in various zeroshot tasks, highlighting the potential of valence and arousal predictions in downstream applications.
Paper Structure (39 sections, 5 equations, 6 figures, 5 tables)

This paper contains 39 sections, 5 equations, 6 figures, 5 tables.

Figures (6)

  • Figure 1: Comparison of multimodal matching.
  • Figure 2: Overview of MMVA.
  • Figure 3: Overview of text-to-music cross-modal retrieval.
  • Figure 4: Overview of Image-to-Music generation.
  • Figure 5: Musical prompt templates.
  • ...and 1 more figures