VioPTT: Violin Technique-Aware Transcription from Synthetic Data Augmentation
Ting-Kang Wang, Yueh-Po Peng, Li Su, Vincent K. M. Cheung
TL;DR
VioPTT introduces a violin-focused transcription framework that jointly predicts pitch/onset/offset and playing technique, addressing a gap in AMT for instrument-specific expressivity. It leverages a high-resolution transcription backbone plus an articulation module, trained with both pitch/timing augmentation and a novel synthetic technique dataset MOSA-VPT derived from MIDI, enabling annotation-free, scalable supervision. The system achieves state-of-the-art or competitive performance on real violin datasets and demonstrates robust generalization of technique classification from synthetic to real audio, validating the use of synthetic data to capture expressive nuances. This approach advances AMT toward richer musical representation and has potential applications in synthesis, performance analysis, and pedagogy.
Abstract
While automatic music transcription is well-established in music information retrieval, most models are limited to transcribing pitch and timing information from audio, and thus omit crucial expressive and instrument-specific nuances. One example is playing technique on the violin, which affords its distinct palette of timbres for maximal emotional impact. Here, we propose VioPTT (Violin Playing Technique-aware Transcription), a lightweight, end-to-end model that directly transcribes violin playing technique in addition to pitch onset and offset. Furthermore, we release MOSA-VPT, a novel, high-quality synthetic violin playing technique dataset to circumvent the need for manually labeled annotations. Leveraging this dataset, our model demonstrated strong generalization to real-world note-level violin technique recordings in addition to achieving state-of-the-art transcription performance. To our knowledge, VioPTT is the first to jointly combine violin transcription and playing technique prediction within a unified framework.
