Table of Contents
Fetching ...

MOSA: Music Motion with Semantic Annotation Dataset for Cross-Modal Music Processing

Yu-Fen Huang, Nikki Moran, Simon Coleman, Jon Kelly, Shun-Hwa Wei, Po-Yin Chen, Yun-Hsin Huang, Tsung-Ping Chen, Yu-Chia Kuo, Yu-Chi Wei, Chih-Hsuan Li, Da-Yu Huang, Hsuan-Kai Kao, Ting-Wei Lin, Li Su

TL;DR

MOSA addresses the challenge of cross-modal music processing by providing a large-scale dataset that pairs high-quality 3-D motion capture with audio and detailed note-level semantics. It introduces a two-stage synchronization pipeline and demonstrates cross-modal tasks in time semantics, expressive semantics, and audio-to-motion generation using CNN/Transformer architectures. The work delivers extensive datasets, annotation schemes, and evaluation results that underscore MOSA's utility for MIR, cross-modal generation, and animation, with practical potential in automatic video and music-video generation. This dataset and framework pave the way for more accurate cross-modal mappings between motion, sound, and music meaning, enabling richer and more controllable music-driven content creation.

Abstract

In cross-modal music processing, translation between visual, auditory, and semantic content opens up new possibilities as well as challenges. The construction of such a transformative scheme depends upon a benchmark corpus with a comprehensive data infrastructure. In particular, the assembly of a large-scale cross-modal dataset presents major challenges. In this paper, we present the MOSA (Music mOtion with Semantic Annotation) dataset, which contains high quality 3-D motion capture data, aligned audio recordings, and note-by-note semantic annotations of pitch, beat, phrase, dynamic, articulation, and harmony for 742 professional music performances by 23 professional musicians, comprising more than 30 hours and 570 K notes of data. To our knowledge, this is the largest cross-modal music dataset with note-level annotations to date. To demonstrate the usage of the MOSA dataset, we present several innovative cross-modal music information retrieval (MIR) and musical content generation tasks, including the detection of beats, downbeats, phrase, and expressive contents from audio, video and motion data, and the generation of musicians' body motion from given music audio. The dataset and codes are available alongside this publication (https://github.com/yufenhuang/MOSA-Music-mOtion-and-Semantic-Annotation-dataset).

MOSA: Music Motion with Semantic Annotation Dataset for Cross-Modal Music Processing

TL;DR

MOSA addresses the challenge of cross-modal music processing by providing a large-scale dataset that pairs high-quality 3-D motion capture with audio and detailed note-level semantics. It introduces a two-stage synchronization pipeline and demonstrates cross-modal tasks in time semantics, expressive semantics, and audio-to-motion generation using CNN/Transformer architectures. The work delivers extensive datasets, annotation schemes, and evaluation results that underscore MOSA's utility for MIR, cross-modal generation, and animation, with practical potential in automatic video and music-video generation. This dataset and framework pave the way for more accurate cross-modal mappings between motion, sound, and music meaning, enabling richer and more controllable music-driven content creation.

Abstract

In cross-modal music processing, translation between visual, auditory, and semantic content opens up new possibilities as well as challenges. The construction of such a transformative scheme depends upon a benchmark corpus with a comprehensive data infrastructure. In particular, the assembly of a large-scale cross-modal dataset presents major challenges. In this paper, we present the MOSA (Music mOtion with Semantic Annotation) dataset, which contains high quality 3-D motion capture data, aligned audio recordings, and note-by-note semantic annotations of pitch, beat, phrase, dynamic, articulation, and harmony for 742 professional music performances by 23 professional musicians, comprising more than 30 hours and 570 K notes of data. To our knowledge, this is the largest cross-modal music dataset with note-level annotations to date. To demonstrate the usage of the MOSA dataset, we present several innovative cross-modal music information retrieval (MIR) and musical content generation tasks, including the detection of beats, downbeats, phrase, and expressive contents from audio, video and motion data, and the generation of musicians' body motion from given music audio. The dataset and codes are available alongside this publication (https://github.com/yufenhuang/MOSA-Music-mOtion-and-Semantic-Annotation-dataset).
Paper Structure (32 sections, 10 equations, 7 figures, 7 tables)

This paper contains 32 sections, 10 equations, 7 figures, 7 tables.

Figures (7)

  • Figure 1: The construction of MOSA dataset (upper), and the A2S, M2S, and A2M modules to transfer between different modalities (lower).
  • Figure 2: The left panel: The collected data (audio, 3-D motion) and semantic annotations of MOSA dataset. The right panel: the statistics of semantic annotations (note, beat, downbeat, phrase, dynamics, articulation, harmony) in MOSA dataset.
  • Figure 3: The Laboratory setting for 3-D motion capture and audio recordings.
  • Figure 4: Hand joint position extracted from video.
  • Figure 5: Experimental results of time and expressive semantics.
  • ...and 2 more figures