Table of Contents
Fetching ...

CartoonSing: Unifying Human and Nonhuman Timbres in Singing Generation

Jionghao Han, Jiatong Shi, Zhuoyan Tao, Yuxun Tang, Yiwen Zhao, Gus Xia, Shinji Watanabe

TL;DR

This work formalizes Non-Human Singing Generation (NHSG) by defining NHSVS and NHSVC and introduces CartoonSing, a two-stage framework that unifies human and non-human timbres in singing. It uses a cross-domain frame-level representation Z=(C,F), where multi-layer content tokens and F0 drive a timbre-aware vocoder that can condition on arbitrary non-human timbres, enabling zero-shot synthesis and conversion. The approach demonstrates strong timbre transfer to non-human domains while preserving intelligibility and musical structure, with domain-specific finetuning further improving stability and pitch accuracy. By enabling scalable, non-parallel training across diverse timbres and eschewing explicit vocal-score alignment for non-human audio, CartoonSing opens new avenues for creative audio synthesis in games, film, and virtual characters.

Abstract

Singing voice synthesis (SVS) and singing voice conversion (SVC) have achieved remarkable progress in generating natural-sounding human singing. However, existing systems are restricted to human timbres and have limited ability to synthesize voices outside the human range, which are increasingly demanded in creative applications such as video games, movies, and virtual characters. We introduce Non-Human Singing Generation (NHSG), covering non-human singing voice synthesis (NHSVS) and non-human singing voice conversion (NHSVC), as a novel machine learning task for generating musically coherent singing with non-human timbral characteristics. NHSG is particularly challenging due to the scarcity of non-human singing data, the lack of symbolic alignment, and the wide timbral gap between human and non-human voices. To address these challenges, we propose CartoonSing, a unified framework that integrates singing voice synthesis and conversion while bridging human and non-human singing generation. CartoonSing employs a two-stage pipeline: a score representation encoder trained with annotated human singing and a timbre-aware vocoder that reconstructs waveforms for both human and non-human audio. Experiments demonstrate that CartoonSing successfully generates non-human singing voices, generalizes to novel timbres, and extends conventional SVS and SVC toward creative, non-human singing generation.

CartoonSing: Unifying Human and Nonhuman Timbres in Singing Generation

TL;DR

This work formalizes Non-Human Singing Generation (NHSG) by defining NHSVS and NHSVC and introduces CartoonSing, a two-stage framework that unifies human and non-human timbres in singing. It uses a cross-domain frame-level representation Z=(C,F), where multi-layer content tokens and F0 drive a timbre-aware vocoder that can condition on arbitrary non-human timbres, enabling zero-shot synthesis and conversion. The approach demonstrates strong timbre transfer to non-human domains while preserving intelligibility and musical structure, with domain-specific finetuning further improving stability and pitch accuracy. By enabling scalable, non-parallel training across diverse timbres and eschewing explicit vocal-score alignment for non-human audio, CartoonSing opens new avenues for creative audio synthesis in games, film, and virtual characters.

Abstract

Singing voice synthesis (SVS) and singing voice conversion (SVC) have achieved remarkable progress in generating natural-sounding human singing. However, existing systems are restricted to human timbres and have limited ability to synthesize voices outside the human range, which are increasingly demanded in creative applications such as video games, movies, and virtual characters. We introduce Non-Human Singing Generation (NHSG), covering non-human singing voice synthesis (NHSVS) and non-human singing voice conversion (NHSVC), as a novel machine learning task for generating musically coherent singing with non-human timbral characteristics. NHSG is particularly challenging due to the scarcity of non-human singing data, the lack of symbolic alignment, and the wide timbral gap between human and non-human voices. To address these challenges, we propose CartoonSing, a unified framework that integrates singing voice synthesis and conversion while bridging human and non-human singing generation. CartoonSing employs a two-stage pipeline: a score representation encoder trained with annotated human singing and a timbre-aware vocoder that reconstructs waveforms for both human and non-human audio. Experiments demonstrate that CartoonSing successfully generates non-human singing voices, generalizes to novel timbres, and extends conventional SVS and SVC toward creative, non-human singing generation.

Paper Structure

This paper contains 40 sections, 16 equations, 3 figures, 15 tables.

Figures (3)

  • Figure 1: Comparison of task formulations for conventional singing voice synthesis (SVS) and conversion (SVC) versus non-human singing voice synthesis (NHSVS) and conversion (NHSVC).
  • Figure 2: An overview of the proposed two-stage synthesis pipeline. Stage 1 trains a score representation encoder $g_\theta$ on annotated human singing data. Stage 2 trains a unified timbre-aware vocoder $h_\phi$ on both human and non-human audio.
  • Figure 3: The inference flow of CartoonSing, demonstrating its application in (a) Non-Human Singing Voice Synthesis (NHSVS) from a musical score, and (b) Non-Human Singing Voice Conversion (NHSVC) from a source audio.