Enabling Beam Search for Language Model-Based Text-to-Speech Synthesis

Zehai Tu; Guangyan Zhang; Yiting Lu; Adaeze Adigwe; Simon King; Yiwen Guo

Enabling Beam Search for Language Model-Based Text-to-Speech Synthesis

Zehai Tu, Guangyan Zhang, Yiting Lu, Adaeze Adigwe, Simon King, Yiwen Guo

TL;DR

This paper looks at maximization-based decoding approaches and proposes Temporal Repetition Aware Diverse Beam Search (TRAD-BS) to find the most probable sequences of the generated speech tokens to generate speech with fewer mispronunciations and improved speaker consistency.

Abstract

Tokenising continuous speech into sequences of discrete tokens and modelling them with language models (LMs) has led to significant success in text-to-speech (TTS) synthesis. Although these models can generate speech with high quality and naturalness, their synthesised samples can still suffer from artefacts, mispronunciation, word repeating, etc. In this paper, we argue these undesirable properties could partly be caused by the randomness of sampling-based strategies during the autoregressive decoding of LMs. Therefore, we look at maximisation-based decoding approaches and propose Temporal Repetition Aware Diverse Beam Search (TRAD-BS) to find the most probable sequences of the generated speech tokens. Experiments with two state-of-the-art LM-based TTS models demonstrate that our proposed maximisation-based decoding strategy generates speech with fewer mispronunciations and improved speaker consistency.

Enabling Beam Search for Language Model-Based Text-to-Speech Synthesis

TL;DR

Abstract

Paper Structure (17 sections, 5 equations, 2 figures, 3 tables)

This paper contains 17 sections, 5 equations, 2 figures, 3 tables.

Introduction
Background
LM-based TTS
Decoding strategies
Method
BS formulation
Pitfalls of BS in LM-based TTS
TRAD-BS
Experiments
Baselines
VoiceCraft
CosyVoice
Datasets
Evaluation
Setup
...and 2 more sections

Figures (2)

Figure 1: (a) A general inference paradigm of LM-based TTS. Speech tokens are generated autoregressively before converted into speech. (b) TRAD-BS operates through both decoding steps and beams, adding temporal and beam-wise penalties to repeated tokens. This can also be applied to the case where multiple tokens need to be decoded at a single step.
Figure 2: Examples of the temporal collapse (upper) and the beam-wise diversity collapse (lower) of BS in LM-based TTS.

Enabling Beam Search for Language Model-Based Text-to-Speech Synthesis

TL;DR

Abstract

Enabling Beam Search for Language Model-Based Text-to-Speech Synthesis

Authors

TL;DR

Abstract

Table of Contents

Figures (2)