Table of Contents
Fetching ...

BreezyVoice: Adapting TTS for Taiwanese Mandarin with Enhanced Polyphone Disambiguation -- Challenges and Insights

Chan-Jan Hsu, Yi-Cheng Lin, Chia-Chun Lin, Wei-Chih Chen, Ho Lam Chung, Chen-An Li, Yi-Chang Chen, Chien-Yu Yu, Ming-Ji Lee, Chien-Cheng Chen, Ru-Heng Huang, Hung-yi Lee, Da-Shan Shiu

TL;DR

BreezyVoice targets Taiwanese Mandarin TTS with challenging polyphone disambiguation by extending CosyVoice with a four‑component pipeline: a Supervised Semantic Speech Tokenizer (S3), a Large Language Model (LLM) for text‑to‑unit generation, an Optimal-transport Conditional Flow Matching (OT-CFM) vocoder, and a g2pW phoneme predictor. The inference pipeline aligns speech tokens to Mel spectrograms via $X_{output} = CFM(v, U_{cond}, U_{output}, X_{cond})$, with optional $Y_{augmented} = g2pW(Y)$. Empirical results show BreezyVoice surpasses commercial TTS in general and code-switching contexts, and Iconic Unit Augmented Speech Cloning reduces phoneme errors while maintaining most speaker similarity. The work provides practical insights into neural codec TTS systems, including inference strategies and ethical considerations for voice cloning deployment.

Abstract

We present BreezyVoice, a Text-to-Speech (TTS) system specifically adapted for Taiwanese Mandarin, highlighting phonetic control abilities to address the unique challenges of polyphone disambiguation in the language. Building upon CosyVoice, we incorporate a $S^{3}$ tokenizer, a large language model (LLM), an optimal-transport conditional flow matching model (OT-CFM), and a grapheme to phoneme prediction model, to generate realistic speech that closely mimics human utterances. Our evaluation demonstrates BreezyVoice's superior performance in both general and code-switching contexts, highlighting its robustness and effectiveness in generating high-fidelity speech. Additionally, we address the challenges of generalizability in modeling long-tail speakers and polyphone disambiguation. Our approach significantly enhances performance and offers valuable insights into the workings of neural codec TTS systems.

BreezyVoice: Adapting TTS for Taiwanese Mandarin with Enhanced Polyphone Disambiguation -- Challenges and Insights

TL;DR

BreezyVoice targets Taiwanese Mandarin TTS with challenging polyphone disambiguation by extending CosyVoice with a four‑component pipeline: a Supervised Semantic Speech Tokenizer (S3), a Large Language Model (LLM) for text‑to‑unit generation, an Optimal-transport Conditional Flow Matching (OT-CFM) vocoder, and a g2pW phoneme predictor. The inference pipeline aligns speech tokens to Mel spectrograms via , with optional . Empirical results show BreezyVoice surpasses commercial TTS in general and code-switching contexts, and Iconic Unit Augmented Speech Cloning reduces phoneme errors while maintaining most speaker similarity. The work provides practical insights into neural codec TTS systems, including inference strategies and ethical considerations for voice cloning deployment.

Abstract

We present BreezyVoice, a Text-to-Speech (TTS) system specifically adapted for Taiwanese Mandarin, highlighting phonetic control abilities to address the unique challenges of polyphone disambiguation in the language. Building upon CosyVoice, we incorporate a tokenizer, a large language model (LLM), an optimal-transport conditional flow matching model (OT-CFM), and a grapheme to phoneme prediction model, to generate realistic speech that closely mimics human utterances. Our evaluation demonstrates BreezyVoice's superior performance in both general and code-switching contexts, highlighting its robustness and effectiveness in generating high-fidelity speech. Additionally, we address the challenges of generalizability in modeling long-tail speakers and polyphone disambiguation. Our approach significantly enhances performance and offers valuable insights into the workings of neural codec TTS systems.

Paper Structure

This paper contains 31 sections, 8 equations, 4 figures, 2 tables.

Figures (4)

  • Figure 1: The augmentation pipeline for integrating Mandarin phonetic symbols employs a decision tree at both the sentence and character levels to diversify the inputs. Additionally, we partially add noise to some of the inputs to prioritize the model's attentiveness to the phonetic symbols.
  • Figure 2: Human preference evaluation comparing BreezyVoice with four competing systems across 30 comparisons on the TCMD dataset. The results demonstrate BreezyVoice's consistent superior performance.
  • Figure 3: Voice cloning Phoneme Error Rate (PER) of individual speakers from FormosaSpeech and Spontaneous Speech datasets.
  • Figure 4: Speaker similarity sensitivity on phoneme error rate reduction, following the application of Iconic Unit Augmented Speech Cloning, in comparison to the standard speech cloning pipeline.