Table of Contents
Fetching ...

Combined Generative and Predictive Modeling for Speech Super-resolution

Heming Wang, Eric W. Healy, DeLiang Wang

TL;DR

This paper tackles speech bandwidth extension by integrating predictive learning with diffusion-based generative modeling to improve robustness in real-world conditions. The authors present a two-stage framework where a predictive model (DPARN) generates a coarse high-frequency estimate that conditions a diffusion-based generator (ARCN), with a repainting inference technique to preserve low-frequency details. Jointly training these components yields superior SR performance on simulated data (VCTK) and demonstrates robustness to mismatched recording conditions, supported by evaluations on real recorded datasets (DAPS and VCTK). Additionally, the work provides freely accessible multi-rate SR recordings to help advance real-world speech super-resolution research. The approach shows strong improvements in SISNR and PESQ while maintaining resilience under mismatched conditions, underscoring its potential for practical deployment in hearing aids, ASR, and TTS pipelines.

Abstract

Speech super-resolution (SR) is the task that restores high-resolution speech from low-resolution input. Existing models employ simulated data and constrained experimental settings, which limit generalization to real-world SR. Predictive models are known to perform well in fixed experimental settings, but can introduce artifacts in adverse conditions. On the other hand, generative models learn the distribution of target data and have a better capacity to perform well on unseen conditions. In this study, we propose a novel two-stage approach that combines the strengths of predictive and generative models. Specifically, we employ a diffusion-based model that is conditioned on the output of a predictive model. Our experiments demonstrate that the model significantly outperforms single-stage counterparts and existing strong baselines on benchmark SR datasets. Furthermore, we introduce a repainting technique during the inference of the diffusion process, enabling the proposed model to regenerate high-frequency components even in mismatched conditions. An additional contribution is the collection of and evaluation on real SR recordings, using the same microphone at different native sampling rates. We make this dataset freely accessible, to accelerate progress towards real-world speech super-resolution.

Combined Generative and Predictive Modeling for Speech Super-resolution

TL;DR

This paper tackles speech bandwidth extension by integrating predictive learning with diffusion-based generative modeling to improve robustness in real-world conditions. The authors present a two-stage framework where a predictive model (DPARN) generates a coarse high-frequency estimate that conditions a diffusion-based generator (ARCN), with a repainting inference technique to preserve low-frequency details. Jointly training these components yields superior SR performance on simulated data (VCTK) and demonstrates robustness to mismatched recording conditions, supported by evaluations on real recorded datasets (DAPS and VCTK). Additionally, the work provides freely accessible multi-rate SR recordings to help advance real-world speech super-resolution research. The approach shows strong improvements in SISNR and PESQ while maintaining resilience under mismatched conditions, underscoring its potential for practical deployment in hearing aids, ASR, and TTS pipelines.

Abstract

Speech super-resolution (SR) is the task that restores high-resolution speech from low-resolution input. Existing models employ simulated data and constrained experimental settings, which limit generalization to real-world SR. Predictive models are known to perform well in fixed experimental settings, but can introduce artifacts in adverse conditions. On the other hand, generative models learn the distribution of target data and have a better capacity to perform well on unseen conditions. In this study, we propose a novel two-stage approach that combines the strengths of predictive and generative models. Specifically, we employ a diffusion-based model that is conditioned on the output of a predictive model. Our experiments demonstrate that the model significantly outperforms single-stage counterparts and existing strong baselines on benchmark SR datasets. Furthermore, we introduce a repainting technique during the inference of the diffusion process, enabling the proposed model to regenerate high-frequency components even in mismatched conditions. An additional contribution is the collection of and evaluation on real SR recordings, using the same microphone at different native sampling rates. We make this dataset freely accessible, to accelerate progress towards real-world speech super-resolution.
Paper Structure (23 sections, 12 equations, 6 figures, 4 tables, 2 algorithms)

This paper contains 23 sections, 12 equations, 6 figures, 4 tables, 2 algorithms.

Figures (6)

  • Figure 1: Illustration of the diffusion processes in a denoising diffusion probabilistic model.
  • Figure 2: Two-stage model diagram that depicts the training procedure of the predictive learning stage and the generative learning stage.
  • Figure 3: Diagram of the proposed attentive residual convolutional network (ARCN) as the diffusion module, and "ResBlock" denotes an attentional residual block.
  • Figure 4: Diagrams showing the detailed design of an attentional residual block within the ARCN encoder.
  • Figure 5: The architecture of an attention layer.
  • ...and 1 more figures