DNN-based ensemble singing voice synthesis with interactions between singers

Hiroaki Hyodo; Shinnosuke Takamichi; Tomohiko Nakamura; Junya Koguchi; Hiroshi Saruwatari

DNN-based ensemble singing voice synthesis with interactions between singers

Hiroaki Hyodo, Shinnosuke Takamichi, Tomohiko Nakamura, Junya Koguchi, Hiroshi Saruwatari

TL;DR

A singing voice synthesis method for a more unified ensemble singing voice by modeling interactions between singers, based on an architecture that uses musical scores of multiple voice parts, and loss functions that simulate the interactions’ effect to acoustic features.

Abstract

We propose a singing voice synthesis (SVS) method for a more unified ensemble singing voice by modeling interactions between singers. Most existing SVS methods aim to synthesize a solo voice, and do not consider interactions between singers, i.e., adjusting one's own voice to the others' voices. Since the production of ensemble voices from solo singing voices ignores the interactions, it can degrade the unity of the vocal ensemble. Therefore, we propose a SVS that reproduces the interactions. It is based on an architecture that uses musical scores of multiple voice parts, and loss functions that simulate the interactions' effect to acoustic features. Experimental results show that our methods improve the unity of the vocal ensemble.

DNN-based ensemble singing voice synthesis with interactions between singers

TL;DR

Abstract

Paper Structure (20 sections, 3 equations, 4 figures, 4 tables)

This paper contains 20 sections, 3 equations, 4 figures, 4 tables.

Introduction
Related works
Interaction among singers
Evaluation metrics of unity of vocal ensemble
Singing voice synthesis
Chorus synthesis based on signal processing
Method
Data preprocessing
Architecture
Loss functions
Inference
Extensibility to more than two voice parts
Experimental evaluation
Experimental conditions
Acoustic model: LSTM vs. diffusion
...and 5 more sections

Figures (4)

Figure 1: Concept of proposed ensemble SVS approach explicitly modeling interactions between singers. Human singers adjust their voices by listening to the others' voice. Conventional SVS methods separately synthesize singing voices for each voice part. In contrast, proposed method produces ensemble singing voices using information of other voice parts as human singers interact.
Figure 2: Two padding methods of score features. Numbers indicate note indices.
Figure 3: Network architecture and loss functions of proposed method. For brevity, singer-embeddings and conventional loss functions are omitted.
Figure 4: Ground-truth (GT) and predicted acoustic features by Baseline and proposed MT+F0diff+Powdiff. Vo and S denote lead vocal and soprano voice parts, respectively. Predicted features of proposed method synchronously change across voice parts, particularly at times marked with red stars.

DNN-based ensemble singing voice synthesis with interactions between singers

TL;DR

Abstract

DNN-based ensemble singing voice synthesis with interactions between singers

Authors

TL;DR

Abstract

Table of Contents

Figures (4)