Table of Contents
Fetching ...

SyncViolinist: Music-Oriented Violin Motion Generation Based on Bowing and Fingering

Hiroki Nishizawa, Keitaro Tanaka, Asuka Hirata, Shugo Yamaguchi, Qi Feng, Masatoshi Hamanaka, Shigeo Morishima

TL;DR

This paper tackles automatic violin motion generation from audio by introducing SyncViolinist, a two-stage framework that first infers bowing and fingering from audio and then generates synchronized, natural motion via multiple BiLSTM branches conditioned on those inferences. A new professionally recorded violin motion dataset with synchronized audio, motion capture, and precise bowing/fingering annotations supports end-to-end training and evaluation, with post-processing to produce render-ready poses. Empirical results show significant improvements over state-of-the-art baselines in both objective metrics (L1, DTW, jerk) and subjective realism validated by professional violinists, including generalization to unseen performers. The work offers a practical pathway to realistic, audio-driven violin animation and provides a rich dataset for future research, with potential extension to other string instruments and applications in digital media workflows.

Abstract

Automatically generating realistic musical performance motion can greatly enhance digital media production, often involving collaboration between professionals and musicians. However, capturing the intricate body, hand, and finger movements required for accurate musical performances is challenging. Existing methods often fall short due to the complex mapping between audio and motion, typically requiring additional inputs like scores or MIDI data. In this work, we present SyncViolinist, a multi-stage end-to-end framework that generates synchronized violin performance motion solely from audio input. Our method overcomes the challenge of capturing both global and fine-grained performance features through two key modules: a bowing/fingering module and a motion generation module. The bowing/fingering module extracts detailed playing information from the audio, which the motion generation module uses to create precise, coordinated body motions reflecting the temporal granularity and nature of the violin performance. We demonstrate the effectiveness of SyncViolinist with significantly improved qualitative and quantitative results from unseen violin performance audio, outperforming state-of-the-art methods. Extensive subjective evaluations involving professional violinists further validate our approach. The code and dataset are available at https://github.com/Kakanat/SyncViolinist.

SyncViolinist: Music-Oriented Violin Motion Generation Based on Bowing and Fingering

TL;DR

This paper tackles automatic violin motion generation from audio by introducing SyncViolinist, a two-stage framework that first infers bowing and fingering from audio and then generates synchronized, natural motion via multiple BiLSTM branches conditioned on those inferences. A new professionally recorded violin motion dataset with synchronized audio, motion capture, and precise bowing/fingering annotations supports end-to-end training and evaluation, with post-processing to produce render-ready poses. Empirical results show significant improvements over state-of-the-art baselines in both objective metrics (L1, DTW, jerk) and subjective realism validated by professional violinists, including generalization to unseen performers. The work offers a practical pathway to realistic, audio-driven violin animation and provides a rich dataset for future research, with potential extension to other string instruments and applications in digital media workflows.

Abstract

Automatically generating realistic musical performance motion can greatly enhance digital media production, often involving collaboration between professionals and musicians. However, capturing the intricate body, hand, and finger movements required for accurate musical performances is challenging. Existing methods often fall short due to the complex mapping between audio and motion, typically requiring additional inputs like scores or MIDI data. In this work, we present SyncViolinist, a multi-stage end-to-end framework that generates synchronized violin performance motion solely from audio input. Our method overcomes the challenge of capturing both global and fine-grained performance features through two key modules: a bowing/fingering module and a motion generation module. The bowing/fingering module extracts detailed playing information from the audio, which the motion generation module uses to create precise, coordinated body motions reflecting the temporal granularity and nature of the violin performance. We demonstrate the effectiveness of SyncViolinist with significantly improved qualitative and quantitative results from unseen violin performance audio, outperforming state-of-the-art methods. Extensive subjective evaluations involving professional violinists further validate our approach. The code and dataset are available at https://github.com/Kakanat/SyncViolinist.

Paper Structure

This paper contains 18 sections, 6 equations, 7 figures, 6 tables.

Figures (7)

  • Figure 1: SyncViolinist can automatically generate synchronized violin performance motions entirely from the audio input, accurately reflecting global and fine-grained performance features such as natural body movements and coordinated bowing and fingering.
  • Figure 2: Proposed method overview. The framework has two components: a bowing/fingering module and a motion generation module.
  • Figure 3: Statistics of the proposed dataset.
  • Figure 4: Subjetive evaluation of motion naturalness between marker- and markerless-based datasets.
  • Figure 5: Samples of generated results and ground truth. More results are available in the supplementary material.
  • ...and 2 more figures