BridgeCode: A Dual Speech Representation Paradigm for Autoregressive Zero-Shot Text-to-Speech Synthesis
Jingyuan Xing, Mingru Yang, Zhipeng Li, Xiaofen Xing, Xiangmin Xu
TL;DR
BridgeTTS tackles two core issues in autoregressive zero-shot TTS: the speed–quality trade-off from sequential token generation and a supervision mismatch that treats all token errors equally. It introduces BridgeCode, a dual-representation paradigm combining sparse tokens with dense features linked by bidirectional bridging modules, and BridgeTTS, which trains a GPT-2–based autoregressor to emit sparse tokens at a reduced frame rate while using DenseBridge to reconstruct rich features for high-quality synthesis. The training jointly optimizes token-level and feature-level supervision to provide fine-grained guidance. Evaluations on LibriTTS show competitive naturalness and speaker similarity with the lowest token rate and faster synthesis, with ablations confirming the necessity of the bridging components and feature loss. The approach generalizes to other AR‑TTS models, offering a practical path toward scalable, zero-shot TTS.
Abstract
Autoregressive (AR) frameworks have recently achieved remarkable progress in zero-shot text-to-speech (TTS) by leveraging discrete speech tokens and large language model techniques. Despite their success, existing AR-based zero-shot TTS systems face two critical limitations: (i) an inherent speed-quality trade-off, as sequential token generation either reduces frame rates at the cost of expressiveness or enriches tokens at the cost of efficiency, and (ii) a text-oriented supervision mismatch, as cross-entropy loss penalizes token errors uniformly without considering the fine-grained acoustic similarity among adjacent tokens. To address these challenges, we propose BridgeTTS, a novel AR-TTS framework built upon the dual speech representation paradigm BridgeCode. BridgeTTS reduces AR iterations by predicting sparse tokens while reconstructing rich continuous features for high-quality synthesis. Joint optimization of token-level and feature-level objectives further enhances naturalness and intelligibility. Experiments demonstrate that BridgeTTS achieves competitive quality and speaker similarity while significantly accelerating synthesis. Speech demos are available at https://test1562.github.io/demo/.
