Table of Contents
Fetching ...

DNN-based ensemble singing voice synthesis with interactions between singers

Hiroaki Hyodo, Shinnosuke Takamichi, Tomohiko Nakamura, Junya Koguchi, Hiroshi Saruwatari

TL;DR

A singing voice synthesis method for a more unified ensemble singing voice by modeling interactions between singers, based on an architecture that uses musical scores of multiple voice parts, and loss functions that simulate the interactions’ effect to acoustic features.

Abstract

We propose a singing voice synthesis (SVS) method for a more unified ensemble singing voice by modeling interactions between singers. Most existing SVS methods aim to synthesize a solo voice, and do not consider interactions between singers, i.e., adjusting one's own voice to the others' voices. Since the production of ensemble voices from solo singing voices ignores the interactions, it can degrade the unity of the vocal ensemble. Therefore, we propose a SVS that reproduces the interactions. It is based on an architecture that uses musical scores of multiple voice parts, and loss functions that simulate the interactions' effect to acoustic features. Experimental results show that our methods improve the unity of the vocal ensemble.

DNN-based ensemble singing voice synthesis with interactions between singers

TL;DR

A singing voice synthesis method for a more unified ensemble singing voice by modeling interactions between singers, based on an architecture that uses musical scores of multiple voice parts, and loss functions that simulate the interactions’ effect to acoustic features.

Abstract

We propose a singing voice synthesis (SVS) method for a more unified ensemble singing voice by modeling interactions between singers. Most existing SVS methods aim to synthesize a solo voice, and do not consider interactions between singers, i.e., adjusting one's own voice to the others' voices. Since the production of ensemble voices from solo singing voices ignores the interactions, it can degrade the unity of the vocal ensemble. Therefore, we propose a SVS that reproduces the interactions. It is based on an architecture that uses musical scores of multiple voice parts, and loss functions that simulate the interactions' effect to acoustic features. Experimental results show that our methods improve the unity of the vocal ensemble.
Paper Structure (20 sections, 3 equations, 4 figures, 4 tables)

This paper contains 20 sections, 3 equations, 4 figures, 4 tables.

Figures (4)

  • Figure 1: Concept of proposed ensemble SVS approach explicitly modeling interactions between singers. Human singers adjust their voices by listening to the others' voice. Conventional SVS methods separately synthesize singing voices for each voice part. In contrast, proposed method produces ensemble singing voices using information of other voice parts as human singers interact.
  • Figure 2: Two padding methods of score features. Numbers indicate note indices.
  • Figure 3: Network architecture and loss functions of proposed method. For brevity, singer-embeddings and conventional loss functions are omitted.
  • Figure 4: Ground-truth (GT) and predicted acoustic features by Baseline and proposed MT+F0diff+Powdiff. Vo and S denote lead vocal and soprano voice parts, respectively. Predicted features of proposed method synchronously change across voice parts, particularly at times marked with red stars.