Table of Contents
Fetching ...

Language Model Mapping in Multimodal Music Learning: A Grand Challenge Proposal

Daniel Chin, Gus Xia

TL;DR

The paper tackles cross-modal knowledge transfer across music modalities by introducing Language Model Mapping (LMM) in Multimodal Music Learning, formalizing a basic setup with abundant score and audio data, a differentiable piano synthesizer, and limited score–audio pairs to learn mappings $g: \,X^{\rm V}\to X^{\rm M}$, $h: \,X^{\rm A}\to X^{\rm M}$, and a motion-focused LM $\text{LM}^{\rm M}$. It argues that current multimodal LM work, which relies on text-grounded tokens, is insufficient to capture musicality and that humans leverage internal LM representations ($\text{LM}^{\rm V}$, $\text{LM}^{\rm A}$) in a closed loop to acquire skill, a principle LMM seeks to formalize. The paper situates LMM relative to existing music AI tasks (OMR, transcription) and demonstrates the need for cross-modal and cross-domain grounding, rather than simply translating modalities to a single backbone. It then outlines both standard and advanced problem versions, articulating potential benefits such as action learning from both symbols and sensory input, sample- and training-efficient learning, and the possibility of a pan-modality core LM (e.g., a hypernetwork). Overall, LMM is proposed as a framework to unify modalities around a shared musicality, enabling more human-like, data-efficient multimodal learning in music and beyond.

Abstract

We have seen remarkable success in representation learning and language models (LMs) using deep neural networks. Many studies aim to build the underlying connections among different modalities via the alignment and mappings at the token or embedding level, but so far, most methods are very data-hungry, limiting their performance in domains such as music where paired data are less abundant. We argue that the embedding alignment is only at the surface level of multimodal alignment. In this paper, we propose a grand challenge of \textit{language model mapping} (LMM), i.e., how to map the essence implied in the LM of one domain to the LM of another domain under the assumption that LMs of different modalities are tracking the same underlying phenomena. We first introduce a basic setup of LMM, highlighting the goal to unveil a deeper aspect of cross-modal alignment as well as to achieve more sample-efficiency learning. We then discuss why music is an ideal domain in which to conduct LMM research. After that, we connect LMM in music with a more general and challenging scientific problem of \textit{learning to take actions based on both sensory input and abstract symbols}, and in the end, present an advanced version of the challenge problem setup.

Language Model Mapping in Multimodal Music Learning: A Grand Challenge Proposal

TL;DR

The paper tackles cross-modal knowledge transfer across music modalities by introducing Language Model Mapping (LMM) in Multimodal Music Learning, formalizing a basic setup with abundant score and audio data, a differentiable piano synthesizer, and limited score–audio pairs to learn mappings , , and a motion-focused LM . It argues that current multimodal LM work, which relies on text-grounded tokens, is insufficient to capture musicality and that humans leverage internal LM representations (, ) in a closed loop to acquire skill, a principle LMM seeks to formalize. The paper situates LMM relative to existing music AI tasks (OMR, transcription) and demonstrates the need for cross-modal and cross-domain grounding, rather than simply translating modalities to a single backbone. It then outlines both standard and advanced problem versions, articulating potential benefits such as action learning from both symbols and sensory input, sample- and training-efficient learning, and the possibility of a pan-modality core LM (e.g., a hypernetwork). Overall, LMM is proposed as a framework to unify modalities around a shared musicality, enabling more human-like, data-efficient multimodal learning in music and beyond.

Abstract

We have seen remarkable success in representation learning and language models (LMs) using deep neural networks. Many studies aim to build the underlying connections among different modalities via the alignment and mappings at the token or embedding level, but so far, most methods are very data-hungry, limiting their performance in domains such as music where paired data are less abundant. We argue that the embedding alignment is only at the surface level of multimodal alignment. In this paper, we propose a grand challenge of \textit{language model mapping} (LMM), i.e., how to map the essence implied in the LM of one domain to the LM of another domain under the assumption that LMs of different modalities are tracking the same underlying phenomena. We first introduce a basic setup of LMM, highlighting the goal to unveil a deeper aspect of cross-modal alignment as well as to achieve more sample-efficiency learning. We then discuss why music is an ideal domain in which to conduct LMM research. After that, we connect LMM in music with a more general and challenging scientific problem of \textit{learning to take actions based on both sensory input and abstract symbols}, and in the end, present an advanced version of the challenge problem setup.

Paper Structure

This paper contains 12 sections, 3 figures.

Figures (3)

  • Figure 1: (a) The three modalities in music. Vision (V) refers to music score images. Audio (A) refers to music audio. Motion (M) refers to instrument controls, i.e., detailed performance motions. Data in each modality can be modeled by a uni-modal LM. Arrows across modalities refer to time-aligned translation tasks, including OMR and transcription. (b) Illustration of the basic setup. The basic version of our grand challenge involves three given elements and three desired goals. Data are abundantly available in A and V, and the synthesis task is considered to be already well-solved by rule-based music synthesizers as virtual instruments. Given the above, we seek the three goals marked with unique colors: OMR, transcription, and an LM in M.
  • Figure 2: An example that LMM can be cross-domain. The within-domain LMM between music score and audio is contingent on different modalities sharing the same musicality. The cross-domain LMM between music audio and sculpture video is contingent on different domains of art sharing the same artistic nature.
  • Figure 3: The advanced version of Fig.\ref{['fig:basic-setup']}(a), generalizing the LMM challenge in multimodal music learning. Vision (V) refers to music score images. Audio (A) refers to music audio. Motion (M) refers to instrument controls, i.e., detailed performance motions. Data in each modality can be modeled by a uni-modal LM. Arrows across modalities refer to time-aligned translation tasks, including OMR and transcription. The advanced version puts forward the question of how data in one modality may help the training of LM in other modalities in general.