Table of Contents
Fetching ...

SSR-Speech: Towards Stable, Safe and Robust Zero-shot Text-based Speech Editing and Synthesis

Helin Wang, Meng Yu, Jiarui Hai, Chen Chen, Yuchen Hu, Rilin Chen, Najim Dehak, Dong Yu

TL;DR

SSR-Speech tackles zero-shot text-based speech editing and TTS with a stable autoregressive Transformer-based neural codec. It couples an inference-time, classifier-free guidance strategy with a Watermark Encodec to embed frame-level watermarks and enable edit detection, while context-aware decoding preserves unedited regions and improves reconstruction in noisy conditions. The approach achieves state-of-the-art results on RealEdit and LibriTTS, demonstrates robustness to multi-span edits and background sounds, and provides strong watermark detection accuracy, all while remaining open-source for safety and research use. The work highlights practical impact for controllable speech editing and safe synthesis across languages, and lays a foundation for future extensions to advanced codecs, multi-task generation, and prosody editing.

Abstract

In this paper, we introduce SSR-Speech, a neural codec autoregressive model designed for stable, safe, and robust zero-shot textbased speech editing and text-to-speech synthesis. SSR-Speech is built on a Transformer decoder and incorporates classifier-free guidance to enhance the stability of the generation process. A watermark Encodec is proposed to embed frame-level watermarks into the edited regions of the speech so that which parts were edited can be detected. In addition, the waveform reconstruction leverages the original unedited speech segments, providing superior recovery compared to the Encodec model. Our approach achieves state-of-the-art performance in the RealEdit speech editing task and the LibriTTS text-to-speech task, surpassing previous methods. Furthermore, SSR-Speech excels in multi-span speech editing and also demonstrates remarkable robustness to background sounds. The source code and demos are released.

SSR-Speech: Towards Stable, Safe and Robust Zero-shot Text-based Speech Editing and Synthesis

TL;DR

SSR-Speech tackles zero-shot text-based speech editing and TTS with a stable autoregressive Transformer-based neural codec. It couples an inference-time, classifier-free guidance strategy with a Watermark Encodec to embed frame-level watermarks and enable edit detection, while context-aware decoding preserves unedited regions and improves reconstruction in noisy conditions. The approach achieves state-of-the-art results on RealEdit and LibriTTS, demonstrates robustness to multi-span edits and background sounds, and provides strong watermark detection accuracy, all while remaining open-source for safety and research use. The work highlights practical impact for controllable speech editing and safe synthesis across languages, and lays a foundation for future extensions to advanced codecs, multi-task generation, and prosody editing.

Abstract

In this paper, we introduce SSR-Speech, a neural codec autoregressive model designed for stable, safe, and robust zero-shot textbased speech editing and text-to-speech synthesis. SSR-Speech is built on a Transformer decoder and incorporates classifier-free guidance to enhance the stability of the generation process. A watermark Encodec is proposed to embed frame-level watermarks into the edited regions of the speech so that which parts were edited can be detected. In addition, the waveform reconstruction leverages the original unedited speech segments, providing superior recovery compared to the Encodec model. Our approach achieves state-of-the-art performance in the RealEdit speech editing task and the LibriTTS text-to-speech task, surpassing previous methods. Furthermore, SSR-Speech excels in multi-span speech editing and also demonstrates remarkable robustness to background sounds. The source code and demos are released.
Paper Structure (14 sections, 3 equations, 2 figures, 2 tables, 1 algorithm)

This paper contains 14 sections, 3 equations, 2 figures, 2 tables, 1 algorithm.

Figures (2)

  • Figure 1: Diagram of SSR-Speech model. We take single-span editing as an instance, in which $\{a_{T_1},..., a_{T_2}\}$ are masked and to be predicted. Here, $1 \leq T_1 < T_2 \leq T_3$, where $T_3$ is the length of the audio.
  • Figure 2: Diagram of the watermark Encodec model. During training, the parameters of the speech encoder and quantizer are kept frozen, while we update the speech decoder, masked encoder, and watermark predictor.