DisSR: Disentangling Speech Representation for Degradation-Prior Guided Cross-Domain Speech Restoration

Ziqi Liang; Zhijun Jia; Chang Liu; Minghui Yang; Zhihong Lu; Jian Wang

DisSR: Disentangling Speech Representation for Degradation-Prior Guided Cross-Domain Speech Restoration

Ziqi Liang, Zhijun Jia, Chang Liu, Minghui Yang, Zhihong Lu, Jian Wang

TL;DR

DisSR, a Disentangling Speech Representation based general speech restoration model with two properties: Degradation-prior guidance, which extracts speaker-invariant degradation representation to guide the diffusion-based speech restoration model and domain adaptation, where the model's adaptability and generalization on cross-domain data are enhanced.

Abstract

Previous speech restoration (SR) primarily focuses on single-task speech restoration (SSR), which cannot address general speech restoration problems. Training specific SSR models for different distortions is time-consuming and lacks generality. In addition, most studies ignore the problem of model generalization across unseen domains. To overcome those limitations, we propose DisSR, a Disentangling Speech Representation based general speech restoration model with two properties: 1) Degradation-prior guidance, which extracts speaker-invariant degradation representation to guide the diffusion-based speech restoration model. 2) Domain adaptation, where we design cross-domain alignment training to enhance the model's adaptability and generalization on cross-domain data, respectively. Experimental results demonstrate that our method can produce high-quality restored speech under various distortion conditions. Audio samples can be found at https://itspsp.github.io/DisSR.

DisSR: Disentangling Speech Representation for Degradation-Prior Guided Cross-Domain Speech Restoration

TL;DR

Abstract

Paper Structure (12 sections, 6 equations, 4 figures, 3 tables)

This paper contains 12 sections, 6 equations, 4 figures, 3 tables.

Introduction
DisSR
Background and hypothesis
Speaker-invariant degradation disentanglement
Cross-domain speech restoration
Experiment
Experiment setup
Hypothesis validation
Restoration performance evaluation
Single restoration task evaluation
Ablation study
Conclusion

Figures (4)

Figure 1: Different Distortions in Speech Signals
Figure 2: Overall pipeline of DisSR: $c^{d_{i}}$ and $s_{i}^{d_{i}}$ are the content and speaker style extracted from input $x_{s_{i}}^{d_{i}}$. Instance Normalization (IN) can eliminate the global speaker style from $x_{s_{i}}^{d_{i}}$. $d_{i}$ is disentangled as degradation-prior to guide restoration. $c^{d_{i}}$ and $\hat{c}$ are the content from $x_{s_{i}}^{d_{i}}$ and predicted speech $\hat{x_{s_{i}}}$, respectively. $s_{i}$ is clean speaker style for reconstruction training.
Figure 3: Classification loss of degradation classifier.
Figure 4: Speech restoration results with different methods

DisSR: Disentangling Speech Representation for Degradation-Prior Guided Cross-Domain Speech Restoration

TL;DR

Abstract

DisSR: Disentangling Speech Representation for Degradation-Prior Guided Cross-Domain Speech Restoration

Authors

TL;DR

Abstract

Table of Contents

Figures (4)