A Study on Synthesizing Expressive Violin Performances: Approaches and Comparisons
Tzu-Yun Hung, Jui-Te Wu, Yu-Chia Kuo, Yo-Wei Hsiao, Ting-Wei Lin, Li Su
TL;DR
The study tackles expressiveness in violin EMS under EMT conditioning, comparing a parameter-controlled Gaussian-DDSP framework (MIDI and MusicXML variants) with an end-to-end StyleSpeech-based approach. Using the SCREAM-MAC-EMT dataset, it performs EMT classification and a human-in-the-loop subjective evaluation to assess model performance across selected EMTs. Findings indicate no model consistently outperforms the others; parameter-controlled models tend to align with human performances when authenticity is valued, while the end-to-end model can rival or surpass in cases where EMT meanings are ambiguous. Inference efficiency also differs, with the end-to-end system delivering faster results, highlighting a trade-off between realism, controllability, and practicality in EMS for violin performance.
Abstract
Expressive music synthesis (EMS) for violin performance is a challenging task due to the disagreement among music performers in the interpretation of expressive musical terms (EMTs), scarcity of labeled recordings, and limited generalization ability of the synthesis model. These challenges create trade-offs between model effectiveness, diversity of generated results, and controllability of the synthesis system, making it essential to conduct a comparative study on EMS model design. This paper explores two violin EMS approaches. The end-to-end approach is a modification of a state-of-the-art text-to-speech generator. The parameter-controlled approach is based on a simple parameter sampling process that can render note lengths and other parameters compatible with MIDI-DDSP. We study these two approaches (in total, three model variants) through objective and subjective experiments and discuss several key issues of EMS based on the results.
