Table of Contents
Fetching ...

VioPTT: Violin Technique-Aware Transcription from Synthetic Data Augmentation

Ting-Kang Wang, Yueh-Po Peng, Li Su, Vincent K. M. Cheung

TL;DR

VioPTT introduces a violin-focused transcription framework that jointly predicts pitch/onset/offset and playing technique, addressing a gap in AMT for instrument-specific expressivity. It leverages a high-resolution transcription backbone plus an articulation module, trained with both pitch/timing augmentation and a novel synthetic technique dataset MOSA-VPT derived from MIDI, enabling annotation-free, scalable supervision. The system achieves state-of-the-art or competitive performance on real violin datasets and demonstrates robust generalization of technique classification from synthetic to real audio, validating the use of synthetic data to capture expressive nuances. This approach advances AMT toward richer musical representation and has potential applications in synthesis, performance analysis, and pedagogy.

Abstract

While automatic music transcription is well-established in music information retrieval, most models are limited to transcribing pitch and timing information from audio, and thus omit crucial expressive and instrument-specific nuances. One example is playing technique on the violin, which affords its distinct palette of timbres for maximal emotional impact. Here, we propose VioPTT (Violin Playing Technique-aware Transcription), a lightweight, end-to-end model that directly transcribes violin playing technique in addition to pitch onset and offset. Furthermore, we release MOSA-VPT, a novel, high-quality synthetic violin playing technique dataset to circumvent the need for manually labeled annotations. Leveraging this dataset, our model demonstrated strong generalization to real-world note-level violin technique recordings in addition to achieving state-of-the-art transcription performance. To our knowledge, VioPTT is the first to jointly combine violin transcription and playing technique prediction within a unified framework.

VioPTT: Violin Technique-Aware Transcription from Synthetic Data Augmentation

TL;DR

VioPTT introduces a violin-focused transcription framework that jointly predicts pitch/onset/offset and playing technique, addressing a gap in AMT for instrument-specific expressivity. It leverages a high-resolution transcription backbone plus an articulation module, trained with both pitch/timing augmentation and a novel synthetic technique dataset MOSA-VPT derived from MIDI, enabling annotation-free, scalable supervision. The system achieves state-of-the-art or competitive performance on real violin datasets and demonstrates robust generalization of technique classification from synthetic to real audio, validating the use of synthetic data to capture expressive nuances. This approach advances AMT toward richer musical representation and has potential applications in synthesis, performance analysis, and pedagogy.

Abstract

While automatic music transcription is well-established in music information retrieval, most models are limited to transcribing pitch and timing information from audio, and thus omit crucial expressive and instrument-specific nuances. One example is playing technique on the violin, which affords its distinct palette of timbres for maximal emotional impact. Here, we propose VioPTT (Violin Playing Technique-aware Transcription), a lightweight, end-to-end model that directly transcribes violin playing technique in addition to pitch onset and offset. Furthermore, we release MOSA-VPT, a novel, high-quality synthetic violin playing technique dataset to circumvent the need for manually labeled annotations. Leveraging this dataset, our model demonstrated strong generalization to real-world note-level violin technique recordings in addition to achieving state-of-the-art transcription performance. To our knowledge, VioPTT is the first to jointly combine violin transcription and playing technique prediction within a unified framework.

Paper Structure

This paper contains 15 sections, 3 figures, 2 tables.

Figures (3)

  • Figure 1: Overview of our technique-aware violin transcription model.
  • Figure 2: Confusion matrix for classification across four violin playing techniques using all transcribed features. Predictions were aggregated across all folds to highlight overall class-wise error patterns.
  • Figure 3: UMAP visualization of real-life test data on learned synthetic note-level embeddings for four violin playing techniques (flageolet, détaché, pizzicato, spiccato).