Table of Contents
Fetching ...

FluentEditor2: Text-based Speech Editing by Modeling Multi-Scale Acoustic and Prosody Consistency

Rui Liu, Jiatian Xi, Ziyue Jiang, Haizhou Li

TL;DR

FluentEditor2 tackles text-based speech editing by enforcing multi-scale acoustic and prosody consistency during editing. It introduces two fluency-aware losses: Hierarchical Local Acoustic Smoothness Consistency ($L_{HLAC}$) and Contrastive Global Prosody Consistency ($L_{CGPC}$), built on a diffusion-based TSE backbone with word-level masking. Across VCTK and LibriTTS, FluentEditor2 achieves superior objective (MCD, STOI, PESQ) and subjective (FMOS, MOS, IMOS) metrics, with ablations confirming the contributions of each loss component. The approach yields more natural boundary transitions and coherent prosody, advancing practical, high-quality speech editing for real-world applications.

Abstract

Text-based speech editing (TSE) allows users to edit speech by modifying the corresponding text directly without altering the original recording. Current TSE techniques often focus on minimizing discrepancies between generated speech and reference within edited regions during training to achieve fluent TSE performance. However, the generated speech in the edited region should maintain acoustic and prosodic consistency with the unedited region and the original speech at both the local and global levels. To maintain speech fluency, we propose a new fluency speech editing scheme based on our previous \textit{FluentEditor} model, termed \textit{\textbf{FluentEditor2}}, by modeling the multi-scale acoustic and prosody consistency training criterion in TSE training. Specifically, for local acoustic consistency, we propose \textit{hierarchical local acoustic smoothness constraint} to align the acoustic properties of speech frames, phonemes, and words at the boundary between the generated speech in the edited region and the speech in the unedited region. For global prosody consistency, we propose \textit{contrastive global prosody consistency constraint} to keep the speech in the edited region consistent with the prosody of the original utterance. Extensive experiments on the VCTK and LibriTTS datasets show that \textit{FluentEditor2} surpasses existing neural networks-based TSE methods, including Editspeech, Campnet, A$^3$T, FluentSpeech, and our Fluenteditor, in both subjective and objective. Ablation studies further highlight the contributions of each module to the overall effectiveness of the system. Speech demos are available at: \url{https://github.com/Ai-S2-Lab/FluentEditor2}.

FluentEditor2: Text-based Speech Editing by Modeling Multi-Scale Acoustic and Prosody Consistency

TL;DR

FluentEditor2 tackles text-based speech editing by enforcing multi-scale acoustic and prosody consistency during editing. It introduces two fluency-aware losses: Hierarchical Local Acoustic Smoothness Consistency () and Contrastive Global Prosody Consistency (), built on a diffusion-based TSE backbone with word-level masking. Across VCTK and LibriTTS, FluentEditor2 achieves superior objective (MCD, STOI, PESQ) and subjective (FMOS, MOS, IMOS) metrics, with ablations confirming the contributions of each loss component. The approach yields more natural boundary transitions and coherent prosody, advancing practical, high-quality speech editing for real-world applications.

Abstract

Text-based speech editing (TSE) allows users to edit speech by modifying the corresponding text directly without altering the original recording. Current TSE techniques often focus on minimizing discrepancies between generated speech and reference within edited regions during training to achieve fluent TSE performance. However, the generated speech in the edited region should maintain acoustic and prosodic consistency with the unedited region and the original speech at both the local and global levels. To maintain speech fluency, we propose a new fluency speech editing scheme based on our previous \textit{FluentEditor} model, termed \textit{\textbf{FluentEditor2}}, by modeling the multi-scale acoustic and prosody consistency training criterion in TSE training. Specifically, for local acoustic consistency, we propose \textit{hierarchical local acoustic smoothness constraint} to align the acoustic properties of speech frames, phonemes, and words at the boundary between the generated speech in the edited region and the speech in the unedited region. For global prosody consistency, we propose \textit{contrastive global prosody consistency constraint} to keep the speech in the edited region consistent with the prosody of the original utterance. Extensive experiments on the VCTK and LibriTTS datasets show that \textit{FluentEditor2} surpasses existing neural networks-based TSE methods, including Editspeech, Campnet, AT, FluentSpeech, and our Fluenteditor, in both subjective and objective. Ablation studies further highlight the contributions of each module to the overall effectiveness of the system. Speech demos are available at: \url{https://github.com/Ai-S2-Lab/FluentEditor2}.
Paper Structure (29 sections, 9 equations, 3 figures, 6 tables)

This paper contains 29 sections, 9 equations, 3 figures, 6 tables.

Figures (3)

  • Figure 1: The overall workflow of FluentEditor2. The total loss function comprises Reconstruction Loss, and Local Hierarchical Acoustic Smoothness and Contrastive Global Prosody Consistency Losses.
  • Figure 2: Reconstruction performance comparison of mel-spectrograms generated by FluentEditor2 and baseline models. The example shows the sentence "he said he was sorry," with the red box indicating the masked and reconstructed segment "he said he was."
  • Figure 3: Visualization of editing effects on mel-spectrograms for speech insertion, replacement, and deletion. Dotted lines represent time step divisions aligned with each word in the sentence. Red highlights indicate the boundaries of the edited regions where the operations were applied.