Table of Contents
Fetching ...

VoiceBridge: Designing Latent Bridge Models for General Speech Restoration at Scale

Chi Zhang, Zehua Chen, Kaiwen Zheng, Jun Zhu

TL;DR

VoiceBridge addresses general speech restoration at scale by unifying diverse LQ→HQ tasks into a single latent-to-latent generative process. It introduces a latent Schrödinger-Bridge model backed by a transformer, an energy-preserving VAE (EP-VAE) to align waveform and latent spaces across energy levels, and a joint neural prior to homogenize diverse LQ priors. A perceptual-aware fine-tuning stage aligns both latent sampling and VAE decoding with human perceptual quality, enhancing 48 kHz restoration and robustness to unseen degradations. Empirical results show VoiceBridge outperforming strong GSR baselines on in-domain and out-of-domain benchmarks, with efficient sampling and strong zero-shot performance, indicating practical applicability for real-world high-fidelity speech restoration.

Abstract

Bridge models have recently been explored for speech enhancement tasks such as denoising, dereverberation, and super-resolution, while these efforts are typically confined to a single task or small-scale datasets, with constrained general speech restoration (GSR) capability at scale. In this work, we introduce VoiceBridge, a GSR system rooted in latent bridge models (LBMs), capable of reconstructing high-fidelity speech at full-band (\textit{i.e.,} 48~kHz) from various distortions. By compressing speech waveform into continuous latent representations, VoiceBridge models the~\textit{diverse LQ-to-HQ tasks} (namely, low-quality to high-quality) in GSR with~\textit{a single latent-to-latent generative process} backed by a scalable transformer architecture. To better inherit the advantages of bridge models from the data domain to the latent space, we present an energy-preserving variational autoencoder, enhancing the alignment between the waveform and latent space over varying energy levels. Furthermore, to address the difficulty of HQ reconstruction from distinctively different LQ priors, we propose a joint neural prior, uniformly alleviating the reconstruction burden of LBM. At last, considering the key requirement of GSR systems, human perceptual quality, a perceptually aware fine-tuning stage is designed to mitigate the cascading mismatch in generation while improving perceptual alignment. Extensive validation across in-domain and out-of-domain tasks and datasets (\textit{e.g.}, refining recent zero-shot speech and podcast generation results) demonstrates the superior performance of VoiceBridge. Demo samples can be visited at: https://VoiceBridge-demo.github.io/.

VoiceBridge: Designing Latent Bridge Models for General Speech Restoration at Scale

TL;DR

VoiceBridge addresses general speech restoration at scale by unifying diverse LQ→HQ tasks into a single latent-to-latent generative process. It introduces a latent Schrödinger-Bridge model backed by a transformer, an energy-preserving VAE (EP-VAE) to align waveform and latent spaces across energy levels, and a joint neural prior to homogenize diverse LQ priors. A perceptual-aware fine-tuning stage aligns both latent sampling and VAE decoding with human perceptual quality, enhancing 48 kHz restoration and robustness to unseen degradations. Empirical results show VoiceBridge outperforming strong GSR baselines on in-domain and out-of-domain benchmarks, with efficient sampling and strong zero-shot performance, indicating practical applicability for real-world high-fidelity speech restoration.

Abstract

Bridge models have recently been explored for speech enhancement tasks such as denoising, dereverberation, and super-resolution, while these efforts are typically confined to a single task or small-scale datasets, with constrained general speech restoration (GSR) capability at scale. In this work, we introduce VoiceBridge, a GSR system rooted in latent bridge models (LBMs), capable of reconstructing high-fidelity speech at full-band (\textit{i.e.,} 48~kHz) from various distortions. By compressing speech waveform into continuous latent representations, VoiceBridge models the~\textit{diverse LQ-to-HQ tasks} (namely, low-quality to high-quality) in GSR with~\textit{a single latent-to-latent generative process} backed by a scalable transformer architecture. To better inherit the advantages of bridge models from the data domain to the latent space, we present an energy-preserving variational autoencoder, enhancing the alignment between the waveform and latent space over varying energy levels. Furthermore, to address the difficulty of HQ reconstruction from distinctively different LQ priors, we propose a joint neural prior, uniformly alleviating the reconstruction burden of LBM. At last, considering the key requirement of GSR systems, human perceptual quality, a perceptually aware fine-tuning stage is designed to mitigate the cascading mismatch in generation while improving perceptual alignment. Extensive validation across in-domain and out-of-domain tasks and datasets (\textit{e.g.}, refining recent zero-shot speech and podcast generation results) demonstrates the superior performance of VoiceBridge. Demo samples can be visited at: https://VoiceBridge-demo.github.io/.

Paper Structure

This paper contains 36 sections, 23 equations, 5 figures, 15 tables.

Figures (5)

  • Figure 1: Overview of VoiceBridge. The upper part demonstrates the designed LBM-based GSR system. The lower part shows our approaches to building a structural latent space and converging a joint neural prior. On the left, EP-VAE requires alignment between the latent and data space at varying energy levels. On the right, a joint neural is encoded to reduce the distance between LQ priors and the HQ target, facilitating LBM reconstruction.
  • Figure 2: The tSNE visualization maaten2008visualizing of the prior latent before and after prior convergence from augmented VCTK vctk. Note the difference between the scale of axes of the two figures.
  • Figure 3: STFT spectrograms of the same piece of speech restored by different models (a) Low-Quality Signal. (b) VoiceFixer (b) Resemble-Enhance (d) VoiceBridge (Ours) (e) Ground Truth.
  • Figure 4: Ablation study with different GSR metrics. The horizontal axis shows training steps, and the vertical axis displays performance. The models differ in whether EP-VAE and joint neural prior are employed.
  • Figure 5: Wasserstein Distance Matrix of latents with different degradation types for the vanilla prior and the joint neural prior.