High-Resolution Speech Restoration with Latent Diffusion Model

Tushar Dhyani; Florian Lux; Michele Mancusi; Giorgio Fabbro; Fritz Hohl; Ngoc Thang Vu

High-Resolution Speech Restoration with Latent Diffusion Model

Tushar Dhyani, Florian Lux, Michele Mancusi, Giorgio Fabbro, Fritz Hohl, Ngoc Thang Vu

TL;DR

Hi-ResLDM tackles the challenge of restoring speech distorted by multiple distortions while preserving high-frequency detail at 48 kHz. It introduces a two-stage framework where a recovery stage first increases SNR by removing additive distortions, followed by a latent-diffusion–driven restoration stage that operates in the latent space of AudioMAE and is conditioned on the recovered signal. The diffusion objective is formally defined by $\min_\phi \mathbb{E}_{z_y, z_0, z_t, t}[\| z_y - \Psi_{\phi}(z_t, z_0, t) \|^2_2]$ with $z_t = \sqrt{\bar{\alpha}_t} z_y + \sqrt{1 - \bar{\alpha}_t} \boldsymbol{\eta}, \ \boldsymbol{\eta} \sim \mathcal{N}(\mathbf{0}, \mathbf{I})$, enabling stable recovery of high-frequency content. Empirical results on 1250 hours of 48 kHz clean data show that Hi-ResLDM outperforms GAN- and CFM-based baselines on non-intrusive metrics (DNSMOS, NISQA) and intrusive measures (eSTOI, WER), with subjective preference favoring Hi-ResLDM in about 60.8% of cases; iterative refinement provides no clear gains. The work demonstrates a practical pathway to high-fidelity, multi-distortion speech restoration suitable for professional applications, albeit with higher inference time.

Abstract

Traditional speech enhancement methods often oversimplify the task of restoration by focusing on a single type of distortion. Generative models that handle multiple distortions frequently struggle with phone reconstruction and high-frequency harmonics, leading to breathing and gasping artifacts that reduce the intelligibility of reconstructed speech. These models are also computationally demanding, and many solutions are restricted to producing outputs in the wide-band frequency range, which limits their suitability for professional applications. To address these challenges, we propose Hi-ResLDM, a novel generative model based on latent diffusion designed to remove multiple distortions and restore speech recordings to studio quality, sampled at 48kHz. We benchmark Hi-ResLDM against state-of-the-art methods that leverage GAN and Conditional Flow Matching (CFM) components, demonstrating superior performance in regenerating high-frequency-band details. Hi-ResLDM not only excels in non-instrusive metrics but is also consistently preferred in human evaluation and performs competitively on intrusive evaluations, making it ideal for high-resolution speech restoration.

High-Resolution Speech Restoration with Latent Diffusion Model

TL;DR

with

, enabling stable recovery of high-frequency content. Empirical results on 1250 hours of 48 kHz clean data show that Hi-ResLDM outperforms GAN- and CFM-based baselines on non-intrusive metrics (DNSMOS, NISQA) and intrusive measures (eSTOI, WER), with subjective preference favoring Hi-ResLDM in about 60.8% of cases; iterative refinement provides no clear gains. The work demonstrates a practical pathway to high-fidelity, multi-distortion speech restoration suitable for professional applications, albeit with higher inference time.

Abstract

Paper Structure (11 sections, 1 equation, 4 figures, 3 tables)

This paper contains 11 sections, 1 equation, 4 figures, 3 tables.

Introduction
Method
Recovery stage
Restoration stage
Experimental Setup
Data
Training dataset
Evaluation dataset
Evaluation protocol
Results
Conclusion

Figures (4)

Figure 1: Mel-spectrograms of a restored speech signal. The green highlighted rectangles emphasize the sections where the harmonic structure generated by Hi-ResLDM is prominently better compared to Voicefixer liu_voicefixer_2022 and Resemble Enhance resemble_enhance.
Figure 2: A high-level overview of the Hi-ResLDM model, illustrating the components of the proposed two-stage approach. The black arrow connects components used during both training and inference. The blue dashed line connects the training components only, and the black dashed line connects components only used during inference.
Figure 3: The figure illustrates the trend in speech quality across different speech restoration models over five iterative refinement steps.
Figure 4: Plot showing the distribution of speaker recognition cosine similarity (SR-CS) of comparative models. The x-axis shows the name of the model, and the y-axis shows the cosine similarity value.

High-Resolution Speech Restoration with Latent Diffusion Model

TL;DR

Abstract

High-Resolution Speech Restoration with Latent Diffusion Model

Authors

TL;DR

Abstract

Table of Contents

Figures (4)