FreeSVC: Towards Zero-shot Multilingual Singing Voice Conversion
Alef Iury Siqueira Ferreira, Lucas Rafael Gris, Augusto Seben da Rosa, Frederico Santos de Oliveira, Edresson Casanova, Rafael Teixeira Sousa, Arnaldo Candido Junior, Anderson da Silva Soares, Arlindo Galvão Filho
TL;DR
The paper tackles zero-shot multilingual singing voice conversion by introducing FreeSVC, a VITS-based framework that fuses a SPIN content extractor, ECAPA2 speaker encoder, and trainable language embeddings to enable cross-language voice transfer with minimal language-specific training. It leverages RMVPE for robust pitch estimation and trains end-to-end with GAN-based objectives on a diverse multilingual dataset, validated through both objective metrics (WER/CER, F0 correlations) and subjective MOS evaluations. Ablation studies demonstrate the critical role of a multilingual content extractor and the combined benefit of SPIN with language conditioning for cross-lingual singing synthesis. The work shows promising improvements over the baseline, with public release of code and models, and points to future enhancements in speaker modeling for singing data and content extractor refinements to further boost cross-language performance.
Abstract
This work presents FreeSVC, a promising multilingual singing voice conversion approach that leverages an enhanced VITS model with Speaker-invariant Clustering (SPIN) for better content representation and the State-of-the-Art (SOTA) speaker encoder ECAPA2. FreeSVC incorporates trainable language embeddings to handle multiple languages and employs an advanced speaker encoder to disentangle speaker characteristics from linguistic content. Designed for zero-shot learning, FreeSVC enables cross-lingual singing voice conversion without extensive language-specific training. We demonstrate that a multilingual content extractor is crucial for optimal cross-language conversion. Our source code and models are publicly available.
