Table of Contents
Fetching ...

SingIt! Singer Voice Transformation

Amit Eliav, Aaron Taub, Renana Opochinsky, Sharon Gannot

TL;DR

The paper tackles speech-to-singing transfer by enabling zero-shot style transfer from a user’s speech to render any song in that speaker’s style, without requiring parallel singing data. It builds a modular autoencoder-based system that relies on vocal separation (Spleeter), a 256-d speaker embedding (Resemblyzer), a content encoder, and a decoder with a Postnet, operating on log-spectrograms and finalizing with Griffin-Lim vocoding. A new Songlist dataset is introduced to augment content and style variation, and a subjective evaluation with 25 listeners indicates the produced singing aligns more with the target style while preserving content and melody, though liveliness and distinctiveness remain areas for improvement. The work demonstrates a practical pathway for applying speech-driven singing synthesis in media and entertainment, inviting further refinement for multi-singer robustness and expressive realism.

Abstract

In this paper, we propose a model which can generate a singing voice from normal speech utterance by harnessing zero-shot, many-to-many style transfer learning. Our goal is to give anyone the opportunity to sing any song in a timely manner. We present a system comprising several available blocks, as well as a modified auto-encoder, and show how this highly-complex challenge can be achieved by tailoring rather simple solutions together. We demonstrate the applicability of the proposed system using a group of 25 non-expert listeners. Samples of the data generated from our model are provided.

SingIt! Singer Voice Transformation

TL;DR

The paper tackles speech-to-singing transfer by enabling zero-shot style transfer from a user’s speech to render any song in that speaker’s style, without requiring parallel singing data. It builds a modular autoencoder-based system that relies on vocal separation (Spleeter), a 256-d speaker embedding (Resemblyzer), a content encoder, and a decoder with a Postnet, operating on log-spectrograms and finalizing with Griffin-Lim vocoding. A new Songlist dataset is introduced to augment content and style variation, and a subjective evaluation with 25 listeners indicates the produced singing aligns more with the target style while preserving content and melody, though liveliness and distinctiveness remain areas for improvement. The work demonstrates a practical pathway for applying speech-driven singing synthesis in media and entertainment, inviting further refinement for multi-singer robustness and expressive realism.

Abstract

In this paper, we propose a model which can generate a singing voice from normal speech utterance by harnessing zero-shot, many-to-many style transfer learning. Our goal is to give anyone the opportunity to sing any song in a timely manner. We present a system comprising several available blocks, as well as a modified auto-encoder, and show how this highly-complex challenge can be achieved by tailoring rather simple solutions together. We demonstrate the applicability of the proposed system using a group of 25 non-expert listeners. Samples of the data generated from our model are provided.
Paper Structure (13 sections, 4 equations, 2 figures, 1 table)

This paper contains 13 sections, 4 equations, 2 figures, 1 table.

Figures (2)

  • Figure 1: High level system overview.
  • Figure 2: Detailed Solution Architecture.