Table of Contents
Fetching ...

OpenVoice: Versatile Instant Voice Cloning

Zengyi Qin, Wenliang Zhao, Xumin Yu, Xin Sun

TL;DR

OpenVoice tackles flexible voice style control and zero-shot cross-lingual voice cloning by decoupling tone color from language and style generation. It combines a base text-to-speech system that governs language and expressive style with a tone color converter that injects the reference speaker's tone color via an invertible flow, guided by language-neutral IPA representations. Training jointly optimizes naturalness through mel-spectrogram and HiFi-GAN losses while using a KL-divergence objective to strip tone color from content features and reintroduce it via the reference, enabling cross-lingual cloning even for unseen languages. The approach enables fast, non-autoregressive inference, requires only short reference clips, and is publicly released to accelerate research; it has seen broad real-world use and demonstrates strong qualitative results across languages and styles.

Abstract

We introduce OpenVoice, a versatile voice cloning approach that requires only a short audio clip from the reference speaker to replicate their voice and generate speech in multiple languages. OpenVoice represents a significant advancement in addressing the following open challenges in the field: 1) Flexible Voice Style Control. OpenVoice enables granular control over voice styles, including emotion, accent, rhythm, pauses, and intonation, in addition to replicating the tone color of the reference speaker. The voice styles are not directly copied from and constrained by the style of the reference speaker. Previous approaches lacked the ability to flexibly manipulate voice styles after cloning. 2) Zero-Shot Cross-Lingual Voice Cloning. OpenVoice achieves zero-shot cross-lingual voice cloning for languages not included in the massive-speaker training set. Unlike previous approaches, which typically require extensive massive-speaker multi-lingual (MSML) dataset for all languages, OpenVoice can clone voices into a new language without any massive-speaker training data for that language. OpenVoice is also computationally efficient, costing tens of times less than commercially available APIs that offer even inferior performance. To foster further research in the field, we have made the source code and trained model publicly accessible. We also provide qualitative results in our demo website. OpenVoice has been used by more than 2M users worldwide as the voice engine of MyShell.ai

OpenVoice: Versatile Instant Voice Cloning

TL;DR

OpenVoice tackles flexible voice style control and zero-shot cross-lingual voice cloning by decoupling tone color from language and style generation. It combines a base text-to-speech system that governs language and expressive style with a tone color converter that injects the reference speaker's tone color via an invertible flow, guided by language-neutral IPA representations. Training jointly optimizes naturalness through mel-spectrogram and HiFi-GAN losses while using a KL-divergence objective to strip tone color from content features and reintroduce it via the reference, enabling cross-lingual cloning even for unseen languages. The approach enables fast, non-autoregressive inference, requires only short reference clips, and is publicly released to accelerate research; it has seen broad real-world use and demonstrates strong qualitative results across languages and styles.

Abstract

We introduce OpenVoice, a versatile voice cloning approach that requires only a short audio clip from the reference speaker to replicate their voice and generate speech in multiple languages. OpenVoice represents a significant advancement in addressing the following open challenges in the field: 1) Flexible Voice Style Control. OpenVoice enables granular control over voice styles, including emotion, accent, rhythm, pauses, and intonation, in addition to replicating the tone color of the reference speaker. The voice styles are not directly copied from and constrained by the style of the reference speaker. Previous approaches lacked the ability to flexibly manipulate voice styles after cloning. 2) Zero-Shot Cross-Lingual Voice Cloning. OpenVoice achieves zero-shot cross-lingual voice cloning for languages not included in the massive-speaker training set. Unlike previous approaches, which typically require extensive massive-speaker multi-lingual (MSML) dataset for all languages, OpenVoice can clone voices into a new language without any massive-speaker training data for that language. OpenVoice is also computationally efficient, costing tens of times less than commercially available APIs that offer even inferior performance. To foster further research in the field, we have made the source code and trained model publicly accessible. We also provide qualitative results in our demo website. OpenVoice has been used by more than 2M users worldwide as the voice engine of MyShell.ai
Paper Structure (7 sections, 1 figure)

This paper contains 7 sections, 1 figure.

Figures (1)

  • Figure 1: Illustration of the OpenVoice framework. We use a base speaker model to control the styles and languages, and a converter to embody the tone color of the reference speaker into the speech.