Table of Contents
Fetching ...

AttentionStitch: How Attention Solves the Speech Editing Problem

Antonios Alexos, Pierre Baldi

TL;DR

This work proposes a novel approach to speech editing by leveraging a pre-trained text-to-speech (TTS) model and incorporating a double attention block network on top of it to automatically merge the synthesized mel-spectrogram with the mel-spectrogram of the edited text.

Abstract

The generation of natural and high-quality speech from text is a challenging problem in the field of natural language processing. In addition to speech generation, speech editing is also a crucial task, which requires the seamless and unnoticeable integration of edited speech into synthesized speech. We propose a novel approach to speech editing by leveraging a pre-trained text-to-speech (TTS) model, such as FastSpeech 2, and incorporating a double attention block network on top of it to automatically merge the synthesized mel-spectrogram with the mel-spectrogram of the edited text. We refer to this model as AttentionStitch, as it harnesses attention to stitch audio samples together. We evaluate the proposed AttentionStitch model against state-of-the-art baselines on both single and multi-speaker datasets, namely LJSpeech and VCTK. We demonstrate its superior performance through an objective and a subjective evaluation test involving 15 human participants. AttentionStitch is capable of producing high-quality speech, even for words not seen during training, while operating automatically without the need for human intervention. Moreover, AttentionStitch is fast during both training and inference and is able to generate human-sounding edited speech.

AttentionStitch: How Attention Solves the Speech Editing Problem

TL;DR

This work proposes a novel approach to speech editing by leveraging a pre-trained text-to-speech (TTS) model and incorporating a double attention block network on top of it to automatically merge the synthesized mel-spectrogram with the mel-spectrogram of the edited text.

Abstract

The generation of natural and high-quality speech from text is a challenging problem in the field of natural language processing. In addition to speech generation, speech editing is also a crucial task, which requires the seamless and unnoticeable integration of edited speech into synthesized speech. We propose a novel approach to speech editing by leveraging a pre-trained text-to-speech (TTS) model, such as FastSpeech 2, and incorporating a double attention block network on top of it to automatically merge the synthesized mel-spectrogram with the mel-spectrogram of the edited text. We refer to this model as AttentionStitch, as it harnesses attention to stitch audio samples together. We evaluate the proposed AttentionStitch model against state-of-the-art baselines on both single and multi-speaker datasets, namely LJSpeech and VCTK. We demonstrate its superior performance through an objective and a subjective evaluation test involving 15 human participants. AttentionStitch is capable of producing high-quality speech, even for words not seen during training, while operating automatically without the need for human intervention. Moreover, AttentionStitch is fast during both training and inference and is able to generate human-sounding edited speech.
Paper Structure (9 sections, 3 figures)

This paper contains 9 sections, 3 figures.

Figures (3)

  • Figure 1: Overview of our proposed AttentionStitch model. AttentionStitch consists of a pre-trained FS2 model and a Double Attention Block.
  • Figure 2: MOS ($\uparrow$) scores for AttentionStitch, the compared methods, and the reference samples with 95% confidence intervals for LJSpeech. AttentionStitch outperforms the compared methods.
  • Figure 3: MOS ($\uparrow$) and MCD ($\downarrow$) scores for AttentionStitch, the compared methods, and the reference samples with 95% confidence intervals for VCTK. AttentionStitch outperforms the compared methods in both metrics.