TTS-Transducer: End-to-End Speech Synthesis with Neural Transducer
Vladimir Bataev, Subhankar Ghosh, Vitaly Lavrukhin, Jason Li
TL;DR
TTS-Transducer tackles end-to-end TTS by marrying a neural transducer for monotonic text-to-code alignments with a residual-codebook head to predict multiple discrete speech codes per frame. The first codebook is generated via RNNT-based alignment, while remaining codes are produced non-autoregressively using aligned encoder outputs and prior codes, with speaker style conditioned through Global Style Tokens. Training optimizes a weighted loss $\lambda_{total} = (1 - \alpha)\lambda_{RNNT} + \alpha\lambda_{CE}$ with $\alpha = 0.4$, and decoding employs label-looping with nucleus sampling ($p = 0.95$) for the first codebook. Across multiple audio codecs (e.g., EnCodec, NeMo-Codec, DAC) and tokenizations (BPE, IPA), the model achieves competitive intelligibility and robust naturalness on challenging texts without large-scale pretraining, and is open-sourced in NeMo.
Abstract
This work introduces TTS-Transducer - a novel architecture for text-to-speech, leveraging the strengths of audio codec models and neural transducers. Transducers, renowned for their superior quality and robustness in speech recognition, are employed to learn monotonic alignments and allow for avoiding using explicit duration predictors. Neural audio codecs efficiently compress audio into discrete codes, revealing the possibility of applying text modeling approaches to speech generation. However, the complexity of predicting multiple tokens per frame from several codebooks, as necessitated by audio codec models with residual quantizers, poses a significant challenge. The proposed system first uses a transducer architecture to learn monotonic alignments between tokenized text and speech codec tokens for the first codebook. Next, a non-autoregressive Transformer predicts the remaining codes using the alignment extracted from transducer loss. The proposed system is trained end-to-end. We show that TTS-Transducer is a competitive and robust alternative to contemporary TTS systems.
