Table of Contents
Fetching ...

SeamlessEdit: Background Noise Aware Zero-Shot Speech Editing with in-Context Enhancement

Kuan-Yu Chen, Jeng-Lin Li, Jian-Jiun Ding

TL;DR

SeamlessEdit tackles robust zero-shot speech editing under real-world noise by combining $X_s$-level speech separation with sparse Bayesian–based frequency-band noise suppression to form a noise-aware representation. It then leverages a neural codec editing model (VoiceCraft) with an in-context refinement mechanism that uses low-frequency embeddings to produce $X_{le}$, followed by reconstructing a final noisy edit $Y = X_e + X_n$. The framework demonstrates significant gains on the EARS-WHAM noisy dataset, achieving higher NMOS and PES, and favorable SMOS relative to state-of-the-art baselines, with ablations showing the value of in-context refinement. Practically, SeamlessEdit enables natural, boundary-preserving edits for applications such as podcast production, interview restoration, and archival enhancement, by effectively handling overlapping voice and ambient-noise scenarios.

Abstract

With the fast development of zero-shot text-to-speech technologies, it is possible to generate high-quality speech signals that are indistinguishable from the real ones. Speech editing, including speech insertion and replacement, appeals to researchers due to its potential applications. However, existing studies only considered clean speech scenarios. In real-world applications, the existence of environmental noise could significantly degrade the quality of generation. In this study, we propose a noise-resilient speech editing framework, SeamlessEdit, for noisy speech editing. SeamlessEdit adopts a frequency-band-aware noise suppression module and an in-content refinement strategy. It can well address the scenario where the frequency bands of voice and background noise are not separated. The proposed SeamlessEdit framework outperforms state-of-the-art approaches in multiple quantitative and qualitative evaluations.

SeamlessEdit: Background Noise Aware Zero-Shot Speech Editing with in-Context Enhancement

TL;DR

SeamlessEdit tackles robust zero-shot speech editing under real-world noise by combining -level speech separation with sparse Bayesian–based frequency-band noise suppression to form a noise-aware representation. It then leverages a neural codec editing model (VoiceCraft) with an in-context refinement mechanism that uses low-frequency embeddings to produce , followed by reconstructing a final noisy edit . The framework demonstrates significant gains on the EARS-WHAM noisy dataset, achieving higher NMOS and PES, and favorable SMOS relative to state-of-the-art baselines, with ablations showing the value of in-context refinement. Practically, SeamlessEdit enables natural, boundary-preserving edits for applications such as podcast production, interview restoration, and archival enhancement, by effectively handling overlapping voice and ambient-noise scenarios.

Abstract

With the fast development of zero-shot text-to-speech technologies, it is possible to generate high-quality speech signals that are indistinguishable from the real ones. Speech editing, including speech insertion and replacement, appeals to researchers due to its potential applications. However, existing studies only considered clean speech scenarios. In real-world applications, the existence of environmental noise could significantly degrade the quality of generation. In this study, we propose a noise-resilient speech editing framework, SeamlessEdit, for noisy speech editing. SeamlessEdit adopts a frequency-band-aware noise suppression module and an in-content refinement strategy. It can well address the scenario where the frequency bands of voice and background noise are not separated. The proposed SeamlessEdit framework outperforms state-of-the-art approaches in multiple quantitative and qualitative evaluations.

Paper Structure

This paper contains 12 sections, 9 equations, 3 figures, 1 table.

Figures (3)

  • Figure 1: The proposed SeamlessEdit framework separates human voice $X_s$ and suppresses residual noise to derive an edited speech $X_{le}$. An in-context refinement enhances the editing of $X_s$ using $X_{le}$ for indistinguishable noisy editing results. The SBL filter is also adopted to improve the robustness to noise.
  • Figure 2: Mel-spectrograms of each stage of the proposed model; (a) the clean signal; (e) the noisy ground truth. Different editing stages include (b) separated speech $X_s$, (c) noise-suppressed speech $X_{l}$, (d) SeamlessEdit processed clean speech $X_e$, and (f) the final noisy editing result $Y$ of the proposed SeamlessEdit model. Red texts and boxes denote the edited regions.
  • Figure 3: Spectral centroid and bandwidth statistics for short replacement results. We present the result of each editing step in Figure \ref{['fig:mel']} for clean and noisy conditions. An ideal editing result should be indistinguishable from the noisy ground.