Table of Contents
Fetching ...

A Study on Synthesizing Expressive Violin Performances: Approaches and Comparisons

Tzu-Yun Hung, Jui-Te Wu, Yu-Chia Kuo, Yo-Wei Hsiao, Ting-Wei Lin, Li Su

TL;DR

The study tackles expressiveness in violin EMS under EMT conditioning, comparing a parameter-controlled Gaussian-DDSP framework (MIDI and MusicXML variants) with an end-to-end StyleSpeech-based approach. Using the SCREAM-MAC-EMT dataset, it performs EMT classification and a human-in-the-loop subjective evaluation to assess model performance across selected EMTs. Findings indicate no model consistently outperforms the others; parameter-controlled models tend to align with human performances when authenticity is valued, while the end-to-end model can rival or surpass in cases where EMT meanings are ambiguous. Inference efficiency also differs, with the end-to-end system delivering faster results, highlighting a trade-off between realism, controllability, and practicality in EMS for violin performance.

Abstract

Expressive music synthesis (EMS) for violin performance is a challenging task due to the disagreement among music performers in the interpretation of expressive musical terms (EMTs), scarcity of labeled recordings, and limited generalization ability of the synthesis model. These challenges create trade-offs between model effectiveness, diversity of generated results, and controllability of the synthesis system, making it essential to conduct a comparative study on EMS model design. This paper explores two violin EMS approaches. The end-to-end approach is a modification of a state-of-the-art text-to-speech generator. The parameter-controlled approach is based on a simple parameter sampling process that can render note lengths and other parameters compatible with MIDI-DDSP. We study these two approaches (in total, three model variants) through objective and subjective experiments and discuss several key issues of EMS based on the results.

A Study on Synthesizing Expressive Violin Performances: Approaches and Comparisons

TL;DR

The study tackles expressiveness in violin EMS under EMT conditioning, comparing a parameter-controlled Gaussian-DDSP framework (MIDI and MusicXML variants) with an end-to-end StyleSpeech-based approach. Using the SCREAM-MAC-EMT dataset, it performs EMT classification and a human-in-the-loop subjective evaluation to assess model performance across selected EMTs. Findings indicate no model consistently outperforms the others; parameter-controlled models tend to align with human performances when authenticity is valued, while the end-to-end model can rival or surpass in cases where EMT meanings are ambiguous. Inference efficiency also differs, with the end-to-end system delivering faster results, highlighting a trade-off between realism, controllability, and practicality in EMS for violin performance.

Abstract

Expressive music synthesis (EMS) for violin performance is a challenging task due to the disagreement among music performers in the interpretation of expressive musical terms (EMTs), scarcity of labeled recordings, and limited generalization ability of the synthesis model. These challenges create trade-offs between model effectiveness, diversity of generated results, and controllability of the synthesis system, making it essential to conduct a comparative study on EMS model design. This paper explores two violin EMS approaches. The end-to-end approach is a modification of a state-of-the-art text-to-speech generator. The parameter-controlled approach is based on a simple parameter sampling process that can render note lengths and other parameters compatible with MIDI-DDSP. We study these two approaches (in total, three model variants) through objective and subjective experiments and discuss several key issues of EMS based on the results.

Paper Structure

This paper contains 18 sections, 2 equations, 2 figures, 3 tables.

Figures (2)

  • Figure 1: Diagram of the end-to-end and the parameter-controlled EMS models. The training data are illustrated in blue and the bold blue arrows represents the actions in the training process. Italic texts represent the required inputs in the synthesis process. The two variants of the parameter-controlled model are illustrated as the solid-line and the dashed-line arrows in the lower left corner, where one represents MIDI input and the other is musicXML input (i.e., adding articulation).
  • Figure 2: Confusion matrix of EMT classification.