BreezyVoice: Adapting TTS for Taiwanese Mandarin with Enhanced Polyphone Disambiguation -- Challenges and Insights
Chan-Jan Hsu, Yi-Cheng Lin, Chia-Chun Lin, Wei-Chih Chen, Ho Lam Chung, Chen-An Li, Yi-Chang Chen, Chien-Yu Yu, Ming-Ji Lee, Chien-Cheng Chen, Ru-Heng Huang, Hung-yi Lee, Da-Shan Shiu
TL;DR
BreezyVoice targets Taiwanese Mandarin TTS with challenging polyphone disambiguation by extending CosyVoice with a four‑component pipeline: a Supervised Semantic Speech Tokenizer (S3), a Large Language Model (LLM) for text‑to‑unit generation, an Optimal-transport Conditional Flow Matching (OT-CFM) vocoder, and a g2pW phoneme predictor. The inference pipeline aligns speech tokens to Mel spectrograms via $X_{output} = CFM(v, U_{cond}, U_{output}, X_{cond})$, with optional $Y_{augmented} = g2pW(Y)$. Empirical results show BreezyVoice surpasses commercial TTS in general and code-switching contexts, and Iconic Unit Augmented Speech Cloning reduces phoneme errors while maintaining most speaker similarity. The work provides practical insights into neural codec TTS systems, including inference strategies and ethical considerations for voice cloning deployment.
Abstract
We present BreezyVoice, a Text-to-Speech (TTS) system specifically adapted for Taiwanese Mandarin, highlighting phonetic control abilities to address the unique challenges of polyphone disambiguation in the language. Building upon CosyVoice, we incorporate a $S^{3}$ tokenizer, a large language model (LLM), an optimal-transport conditional flow matching model (OT-CFM), and a grapheme to phoneme prediction model, to generate realistic speech that closely mimics human utterances. Our evaluation demonstrates BreezyVoice's superior performance in both general and code-switching contexts, highlighting its robustness and effectiveness in generating high-fidelity speech. Additionally, we address the challenges of generalizability in modeling long-tail speakers and polyphone disambiguation. Our approach significantly enhances performance and offers valuable insights into the workings of neural codec TTS systems.
