Table of Contents
Fetching ...

Enhancing the Stability of LLM-based Speech Generation Systems through Self-Supervised Representations

Álvaro Martín-Cortinas, Daniel Sáez-Trigueros, Iván Vallés-Pérez, Biel Tura-Vecino, Piotr Biliński, Mateusz Lajszczak, Grzegorz Beringer, Roberto Barra-Chicote, Jaime Lorenzo-Trueba

TL;DR

This work tackles instability in LLM-based speech generation by introducing a self-supervised voice conversion (SSVC) model that learns speaker-disentangled representations from raw audio. An LLM is then trained to predict these discrete tokens, enabling zero-shot TTS conditioned only on text, with the speaker controlled by the SSVC decoder. The authors demonstrate that text-prompting with speaker-disentangled codes yields superior stability, intelligibility (lower WER), and speaker similarity, outperforming state-of-the-art baselines including YourTTS, and even surpassing human recordings in some metrics. The approach reduces dependence on parallel data and reference embeddings, offering robust zero-shot capabilities at the cost of higher inference compute, with broad implications for accessibility and human–computer interfaces.

Abstract

Large Language Models (LLMs) are one of the most promising technologies for the next era of speech generation systems, due to their scalability and in-context learning capabilities. Nevertheless, they suffer from multiple stability issues at inference time, such as hallucinations, content skipping or speech repetitions. In this work, we introduce a new self-supervised Voice Conversion (VC) architecture which can be used to learn to encode transitory features, such as content, separately from stationary ones, such as speaker ID or recording conditions, creating speaker-disentangled representations. Using speaker-disentangled codes to train LLMs for text-to-speech (TTS) allows the LLM to generate the content and the style of the speech only from the text, similarly to humans, while the speaker identity is provided by the decoder of the VC model. Results show that LLMs trained over speaker-disentangled self-supervised representations provide an improvement of 4.7pp in speaker similarity over SOTA entangled representations, and a word error rate (WER) 5.4pp lower. Furthermore, they achieve higher naturalness than human recordings of the LibriTTS test-other dataset. Finally, we show that using explicit reference embedding negatively impacts intelligibility (stability), with WER increasing by 14pp compared to the model that only uses text to infer the style.

Enhancing the Stability of LLM-based Speech Generation Systems through Self-Supervised Representations

TL;DR

This work tackles instability in LLM-based speech generation by introducing a self-supervised voice conversion (SSVC) model that learns speaker-disentangled representations from raw audio. An LLM is then trained to predict these discrete tokens, enabling zero-shot TTS conditioned only on text, with the speaker controlled by the SSVC decoder. The authors demonstrate that text-prompting with speaker-disentangled codes yields superior stability, intelligibility (lower WER), and speaker similarity, outperforming state-of-the-art baselines including YourTTS, and even surpassing human recordings in some metrics. The approach reduces dependence on parallel data and reference embeddings, offering robust zero-shot capabilities at the cost of higher inference compute, with broad implications for accessibility and human–computer interfaces.

Abstract

Large Language Models (LLMs) are one of the most promising technologies for the next era of speech generation systems, due to their scalability and in-context learning capabilities. Nevertheless, they suffer from multiple stability issues at inference time, such as hallucinations, content skipping or speech repetitions. In this work, we introduce a new self-supervised Voice Conversion (VC) architecture which can be used to learn to encode transitory features, such as content, separately from stationary ones, such as speaker ID or recording conditions, creating speaker-disentangled representations. Using speaker-disentangled codes to train LLMs for text-to-speech (TTS) allows the LLM to generate the content and the style of the speech only from the text, similarly to humans, while the speaker identity is provided by the decoder of the VC model. Results show that LLMs trained over speaker-disentangled self-supervised representations provide an improvement of 4.7pp in speaker similarity over SOTA entangled representations, and a word error rate (WER) 5.4pp lower. Furthermore, they achieve higher naturalness than human recordings of the LibriTTS test-other dataset. Finally, we show that using explicit reference embedding negatively impacts intelligibility (stability), with WER increasing by 14pp compared to the model that only uses text to infer the style.
Paper Structure (15 sections, 3 equations, 1 figure, 3 tables)