Table of Contents
Fetching ...

Improving Noise Robustness of LLM-based Zero-shot TTS via Discrete Acoustic Token Denoising

Ye-Xin Lu, Hui-Peng Du, Fei Liu, Yang Ai, Zhen-Hua Ling

TL;DR

This work tackles the sensitivity of LLM-based zero-shot TTS to noisy audio prompts by introducing a neural codec-based speech denoiser that operates directly on discrete acoustic tokens. The codec denoiser comprises a token denoiser and an embedding refiner, which predict the first two groups of clean tokens and refine their embeddings before waveform reconstruction, reducing noise artifacts without relying on heavy signal-domain processing. Integrated with LauraTTS to form NR-LauraTTS, the approach achieves superior speech quality and robustness in both enhancement and zero-shot TTS tasks, with lower computational overhead than traditional SE methods. The results demonstrate improved DNSMOS scores and speaker similarity, making noise-robust zero-shot TTS more practical for real-world use where prompt quality can be compromised.

Abstract

Large language model (LLM) based zero-shot text-to-speech (TTS) methods tend to preserve the acoustic environment of the audio prompt, leading to degradation in synthesized speech quality when the audio prompt contains noise. In this paper, we propose a novel neural codec-based speech denoiser and integrate it with the advanced LLM-based TTS model, LauraTTS, to achieve noise-robust zero-shot TTS. The proposed codec denoiser consists of an audio codec, a token denoiser, and an embedding refiner. The token denoiser predicts the first two groups of clean acoustic tokens from the noisy ones, which can serve as the acoustic prompt for LauraTTS to synthesize high-quality personalized speech or be converted to clean speech waveforms through the embedding refiner and codec decoder. Experimental results show that our proposed codec denoiser outperforms state-of-the-art speech enhancement (SE) methods, and the proposed noise-robust LauraTTS surpasses the approach using additional SE models.

Improving Noise Robustness of LLM-based Zero-shot TTS via Discrete Acoustic Token Denoising

TL;DR

This work tackles the sensitivity of LLM-based zero-shot TTS to noisy audio prompts by introducing a neural codec-based speech denoiser that operates directly on discrete acoustic tokens. The codec denoiser comprises a token denoiser and an embedding refiner, which predict the first two groups of clean tokens and refine their embeddings before waveform reconstruction, reducing noise artifacts without relying on heavy signal-domain processing. Integrated with LauraTTS to form NR-LauraTTS, the approach achieves superior speech quality and robustness in both enhancement and zero-shot TTS tasks, with lower computational overhead than traditional SE methods. The results demonstrate improved DNSMOS scores and speaker similarity, making noise-robust zero-shot TTS more practical for real-world use where prompt quality can be compromised.

Abstract

Large language model (LLM) based zero-shot text-to-speech (TTS) methods tend to preserve the acoustic environment of the audio prompt, leading to degradation in synthesized speech quality when the audio prompt contains noise. In this paper, we propose a novel neural codec-based speech denoiser and integrate it with the advanced LLM-based TTS model, LauraTTS, to achieve noise-robust zero-shot TTS. The proposed codec denoiser consists of an audio codec, a token denoiser, and an embedding refiner. The token denoiser predicts the first two groups of clean acoustic tokens from the noisy ones, which can serve as the acoustic prompt for LauraTTS to synthesize high-quality personalized speech or be converted to clean speech waveforms through the embedding refiner and codec decoder. Experimental results show that our proposed codec denoiser outperforms state-of-the-art speech enhancement (SE) methods, and the proposed noise-robust LauraTTS surpasses the approach using additional SE models.

Paper Structure

This paper contains 15 sections, 3 equations, 2 figures, 3 tables.

Figures (2)

  • Figure 1: Noise-robust zero-shot TTS synthesis process of the proposed NR-LauraTTS, where $\textcircled{S}$, $\textcircled{T}$, and $\textcircled{E}$ denote the "start of sequence", "turn of speech", and "end of sequence" tokens.
  • Figure 2: Overall structure of the proposed codec denoiser.