Table of Contents
Fetching ...

A Survey on Cross-Modal Interaction Between Music and Multimodal Data

Sifei Li, Mining Tan, Feier Shen, Minyan Luo, Zijiao Yin, Fan Tang, Weiming Dong, Changsheng Xu

TL;DR

The paper surveys cross-modal interaction between music and multimodal data, organizing the landscape into music-driven, music-oriented, and bidirectional interactions. It provides a taxonomy of representations (audio and symbolic) and catalogs datasets and evaluation metrics, outlining tasks from music captioning and lyrics transcription to text-, image-, and video-conditioned music generation. Key contributions include a structured synthesis of methods, datasets, and benchmarks, plus critical discussion of data scarcity, modality fusion, and evaluation challenges. The work highlights practical implications for music creation, retrieval, and understanding, guiding future research toward richer multimodal music systems and scalable datasets.

Abstract

Multimodal learning has driven innovation across various industries, particularly in the field of music. By enabling more intuitive interaction experiences and enhancing immersion, it not only lowers the entry barriers to the music but also increases its overall appeal. This survey aims to provide a comprehensive review of multimodal tasks related to music, outlining how music contributes to multimodal learning and offering insights for researchers seeking to expand the boundaries of computational music. Unlike text and images, which are often semantically or visually intuitive, music primarily interacts with humans through auditory perception, making its data representation inherently less intuitive. Therefore, this paper first introduces the representations of music and provides an overview of music datasets. Subsequently, we categorize cross-modal interactions between music and multimodal data into three types: music-driven cross-modal interactions, music-oriented cross-modal interactions, and bidirectional music cross-modal interactions. For each category, we systematically trace the development of relevant sub-tasks, analyze existing limitations, and discuss emerging trends. Furthermore, we provide a comprehensive summary of datasets and evaluation metrics used in multimodal tasks related to music, offering benchmark references for future research. Finally, we discuss the current challenges in cross-modal interactions involving music and propose potential directions for future research.

A Survey on Cross-Modal Interaction Between Music and Multimodal Data

TL;DR

The paper surveys cross-modal interaction between music and multimodal data, organizing the landscape into music-driven, music-oriented, and bidirectional interactions. It provides a taxonomy of representations (audio and symbolic) and catalogs datasets and evaluation metrics, outlining tasks from music captioning and lyrics transcription to text-, image-, and video-conditioned music generation. Key contributions include a structured synthesis of methods, datasets, and benchmarks, plus critical discussion of data scarcity, modality fusion, and evaluation challenges. The work highlights practical implications for music creation, retrieval, and understanding, guiding future research toward richer multimodal music systems and scalable datasets.

Abstract

Multimodal learning has driven innovation across various industries, particularly in the field of music. By enabling more intuitive interaction experiences and enhancing immersion, it not only lowers the entry barriers to the music but also increases its overall appeal. This survey aims to provide a comprehensive review of multimodal tasks related to music, outlining how music contributes to multimodal learning and offering insights for researchers seeking to expand the boundaries of computational music. Unlike text and images, which are often semantically or visually intuitive, music primarily interacts with humans through auditory perception, making its data representation inherently less intuitive. Therefore, this paper first introduces the representations of music and provides an overview of music datasets. Subsequently, we categorize cross-modal interactions between music and multimodal data into three types: music-driven cross-modal interactions, music-oriented cross-modal interactions, and bidirectional music cross-modal interactions. For each category, we systematically trace the development of relevant sub-tasks, analyze existing limitations, and discuss emerging trends. Furthermore, we provide a comprehensive summary of datasets and evaluation metrics used in multimodal tasks related to music, offering benchmark references for future research. Finally, we discuss the current challenges in cross-modal interactions involving music and propose potential directions for future research.

Paper Structure

This paper contains 50 sections, 7 figures, 1 table.

Figures (7)

  • Figure 1: An overview of cross-modal interaction between music and multimodal data.
  • Figure 2: Examples of audio music representations
  • Figure 3: Examples of symbolic music representations
  • Figure 4: An overview of the music caption methods
  • Figure 5: An overview of the music-driven matched video methods
  • ...and 2 more figures