Table of Contents
Fetching ...

TacoLM: GaTed Attention Equipped Codec Language Model are Efficient Zero-Shot Text to Speech Synthesizers

Yakun Song, Zhuo Chen, Xiaofei Wang, Ziyang Ma, Guanrou Yang, Xie Chen

TL;DR

TacoLM tackles the practical inefficiencies of zero-shot TTS by integrating a two-stage autoregressive/non-autoregressive neural codec LM with MEGA gated attention and a novel gated cross-attention to improve alignment between text and audio. The model leverages EnCodec as a neural audio codec and discrete token modeling to achieve high-quality speech for unseen speakers while dramatically reducing parameters and increasing speed. Empirical results on LibriSpeech demonstrate superior WER, speaker similarity, and MOS compared to VALL-E, with about 90% fewer parameters and roughly 5.2x faster inference, and ablations confirm the contributions of GPSA and GCA. These findings indicate TacoLM offers a more efficient and accurate zero-shot TTS solution with practical deployment potential, and the authors provide open training resources to advance research in this area.

Abstract

Neural codec language model (LM) has demonstrated strong capability in zero-shot text-to-speech (TTS) synthesis. However, the codec LM often suffers from limitations in inference speed and stability, due to its auto-regressive nature and implicit alignment between text and audio. In this work, to handle these challenges, we introduce a new variant of neural codec LM, namely TacoLM. Specifically, TacoLM introduces a gated attention mechanism to improve the training and inference efficiency and reduce the model size. Meanwhile, an additional gated cross-attention layer is included for each decoder layer, which improves the efficiency and content accuracy of the synthesized speech. In the evaluation of the Librispeech corpus, the proposed TacoLM achieves a better word error rate, speaker similarity, and mean opinion score, with 90% fewer parameters and 5.2 times speed up, compared with VALL-E. Demo and code is available at https://ereboas.github.io/TacoLM/.

TacoLM: GaTed Attention Equipped Codec Language Model are Efficient Zero-Shot Text to Speech Synthesizers

TL;DR

TacoLM tackles the practical inefficiencies of zero-shot TTS by integrating a two-stage autoregressive/non-autoregressive neural codec LM with MEGA gated attention and a novel gated cross-attention to improve alignment between text and audio. The model leverages EnCodec as a neural audio codec and discrete token modeling to achieve high-quality speech for unseen speakers while dramatically reducing parameters and increasing speed. Empirical results on LibriSpeech demonstrate superior WER, speaker similarity, and MOS compared to VALL-E, with about 90% fewer parameters and roughly 5.2x faster inference, and ablations confirm the contributions of GPSA and GCA. These findings indicate TacoLM offers a more efficient and accurate zero-shot TTS solution with practical deployment potential, and the authors provide open training resources to advance research in this area.

Abstract

Neural codec language model (LM) has demonstrated strong capability in zero-shot text-to-speech (TTS) synthesis. However, the codec LM often suffers from limitations in inference speed and stability, due to its auto-regressive nature and implicit alignment between text and audio. In this work, to handle these challenges, we introduce a new variant of neural codec LM, namely TacoLM. Specifically, TacoLM introduces a gated attention mechanism to improve the training and inference efficiency and reduce the model size. Meanwhile, an additional gated cross-attention layer is included for each decoder layer, which improves the efficiency and content accuracy of the synthesized speech. In the evaluation of the Librispeech corpus, the proposed TacoLM achieves a better word error rate, speaker similarity, and mean opinion score, with 90% fewer parameters and 5.2 times speed up, compared with VALL-E. Demo and code is available at https://ereboas.github.io/TacoLM/.
Paper Structure (15 sections, 2 equations, 2 figures, 3 tables)

This paper contains 15 sections, 2 equations, 2 figures, 3 tables.

Figures (2)

  • Figure 1: Framework Overview of the proposed TacoLM. Gated prefix self-attention (GPSA) layers and gated cross-attention (GCA) layers are adopted in the AR model to generate the first layer of codec codes, while GCA layers are adopted in the NAR model to generate the rest layers of codes.
  • Figure 2: Illustration of the details of gated cross-attention layer.$B$ in the yellow box refers to the relative position bias, where we use RoPE rope for position encoding. $d$ is the dimension of $Q$ and $K$.