SeamlessEdit: Background Noise Aware Zero-Shot Speech Editing with in-Context Enhancement
Kuan-Yu Chen, Jeng-Lin Li, Jian-Jiun Ding
TL;DR
SeamlessEdit tackles robust zero-shot speech editing under real-world noise by combining $X_s$-level speech separation with sparse Bayesian–based frequency-band noise suppression to form a noise-aware representation. It then leverages a neural codec editing model (VoiceCraft) with an in-context refinement mechanism that uses low-frequency embeddings to produce $X_{le}$, followed by reconstructing a final noisy edit $Y = X_e + X_n$. The framework demonstrates significant gains on the EARS-WHAM noisy dataset, achieving higher NMOS and PES, and favorable SMOS relative to state-of-the-art baselines, with ablations showing the value of in-context refinement. Practically, SeamlessEdit enables natural, boundary-preserving edits for applications such as podcast production, interview restoration, and archival enhancement, by effectively handling overlapping voice and ambient-noise scenarios.
Abstract
With the fast development of zero-shot text-to-speech technologies, it is possible to generate high-quality speech signals that are indistinguishable from the real ones. Speech editing, including speech insertion and replacement, appeals to researchers due to its potential applications. However, existing studies only considered clean speech scenarios. In real-world applications, the existence of environmental noise could significantly degrade the quality of generation. In this study, we propose a noise-resilient speech editing framework, SeamlessEdit, for noisy speech editing. SeamlessEdit adopts a frequency-band-aware noise suppression module and an in-content refinement strategy. It can well address the scenario where the frequency bands of voice and background noise are not separated. The proposed SeamlessEdit framework outperforms state-of-the-art approaches in multiple quantitative and qualitative evaluations.
