Table of Contents
Fetching ...

Speech Editing -- a Summary

Tobias Kässmann, Yining Liu, Danni Liu

TL;DR

The paper tackles the problem of editing speech content via text transcripts while preserving naturalness and speaker identity. It surveys a range of approaches—from VoCo's voice morphing to diffusion-based and neural codec methods like VoiceBox, SpeechX, FluentSpeech, and Mapache—highlighting transformer-, CNF-, and diffusion-inspired architectures for context-aware editing. Key findings show substantial gains in boundary realism, prosody alignment, and robustness to noise, yet meaningful cross-paper comparisons remain difficult due to varying tasks, datasets, and evaluation metrics. The work emphasizes the practical impact of improved speech editing for media production and accessibility and calls for standardized benchmarks to accelerate progress.

Abstract

With the rise of video production and social media, speech editing has become crucial for creators to address issues like mispronunciations, missing words, or stuttering in audio recordings. This paper explores text-based speech editing methods that modify audio via text transcripts without manual waveform editing. These approaches ensure edited audio is indistinguishable from the original by altering the mel-spectrogram. Recent advancements, such as context-aware prosody correction and advanced attention mechanisms, have improved speech editing quality. This paper reviews state-of-the-art methods, compares key metrics, and examines widely used datasets. The aim is to highlight ongoing issues and inspire further research and innovation in speech editing.

Speech Editing -- a Summary

TL;DR

The paper tackles the problem of editing speech content via text transcripts while preserving naturalness and speaker identity. It surveys a range of approaches—from VoCo's voice morphing to diffusion-based and neural codec methods like VoiceBox, SpeechX, FluentSpeech, and Mapache—highlighting transformer-, CNF-, and diffusion-inspired architectures for context-aware editing. Key findings show substantial gains in boundary realism, prosody alignment, and robustness to noise, yet meaningful cross-paper comparisons remain difficult due to varying tasks, datasets, and evaluation metrics. The work emphasizes the practical impact of improved speech editing for media production and accessibility and calls for standardized benchmarks to accelerate progress.

Abstract

With the rise of video production and social media, speech editing has become crucial for creators to address issues like mispronunciations, missing words, or stuttering in audio recordings. This paper explores text-based speech editing methods that modify audio via text transcripts without manual waveform editing. These approaches ensure edited audio is indistinguishable from the original by altering the mel-spectrogram. Recent advancements, such as context-aware prosody correction and advanced attention mechanisms, have improved speech editing quality. This paper reviews state-of-the-art methods, compares key metrics, and examines widely used datasets. The aim is to highlight ongoing issues and inspire further research and innovation in speech editing.
Paper Structure (21 sections, 1 figure, 6 tables)

This paper contains 21 sections, 1 figure, 6 tables.

Figures (1)

  • Figure 1: Speech Editing by inpainting (Example Illustration of an additional word getting added that was not present before)