Table of Contents
Fetching ...

Muskits-ESPnet: A Comprehensive Toolkit for Singing Voice Synthesis in New Paradigm

Yuning Wu, Jiatong Shi, Yifeng Yu, Yuxun Tang, Tao Qian, Yueqian Lin, Jionghao Han, Xinyi Bai, Shinji Watanabe, Qin Jin

TL;DR

The toolkit features automatic music score error detection and correction, as well as a perception auto-evaluation module to imitate human subjective evaluating scores, supporting multi-format inputs and adaptable data processing workflows for various SVS models.

Abstract

This research presents Muskits-ESPnet, a versatile toolkit that introduces new paradigms to Singing Voice Synthesis (SVS) through the application of pretrained audio models in both continuous and discrete approaches. Specifically, we explore discrete representations derived from SSL models and audio codecs and offer significant advantages in versatility and intelligence, supporting multi-format inputs and adaptable data processing workflows for various SVS models. The toolkit features automatic music score error detection and correction, as well as a perception auto-evaluation module to imitate human subjective evaluating scores. Muskits-ESPnet is available at \url{https://github.com/espnet/espnet}.

Muskits-ESPnet: A Comprehensive Toolkit for Singing Voice Synthesis in New Paradigm

TL;DR

The toolkit features automatic music score error detection and correction, as well as a perception auto-evaluation module to imitate human subjective evaluating scores, supporting multi-format inputs and adaptable data processing workflows for various SVS models.

Abstract

This research presents Muskits-ESPnet, a versatile toolkit that introduces new paradigms to Singing Voice Synthesis (SVS) through the application of pretrained audio models in both continuous and discrete approaches. Specifically, we explore discrete representations derived from SSL models and audio codecs and offer significant advantages in versatility and intelligence, supporting multi-format inputs and adaptable data processing workflows for various SVS models. The toolkit features automatic music score error detection and correction, as well as a perception auto-evaluation module to imitate human subjective evaluating scores. Muskits-ESPnet is available at \url{https://github.com/espnet/espnet}.
Paper Structure (7 sections, 2 figures)

This paper contains 7 sections, 2 figures.

Figures (2)

  • Figure 1: Improvements of Muskits-ESPnet compared with its origin version. The boldface indicates new functions.
  • Figure 2: (a) SVS workflow. The upper section illustrates the training pipeline, while the lower section shows the inference pipeline for users. The functions in the yellow blocks can be flexibly selected based on specific requirements.(b) Different SVS feature representations. Continuous features include mel spectrograms and hidden embeddings from SSL models. Discrete representations consist of semantic tokens clustered from SSL models and acoustic tokens extracted from codecs.