Rethinking Training Targets, Architectures and Data Quality for Universal Speech Enhancement

Szu-Wei Fu; Rong Chao; Xuesong Yang; Sung-Feng Huang; Ryandhimas E. Zezario; Rauf Nasretdinov; Ante Jukić; Yu Tsao; Yu-Chiang Frank Wang

Rethinking Training Targets, Architectures and Data Quality for Universal Speech Enhancement

Szu-Wei Fu, Rong Chao, Xuesong Yang, Sung-Feng Huang, Ryandhimas E. Zezario, Rauf Nasretdinov, Ante Jukić, Yu Tsao, Yu-Chiang Frank Wang

TL;DR

This work revisits the conventional practice of using early-reflected speech as the dereverberation target and shows that it can degrade perceptual quality and downstream ASR performance, and proposes a simple two-stage framework that achieves minimal distortion under a given level of perceptual quality.

Abstract

Universal Speech Enhancement (USE) aims to restore speech quality under diverse degradation conditions while preserving signal fidelity. Despite recent progress, key challenges in training target selection, the distortion--perception tradeoff, and data curation remain unresolved. In this work, we systematically address these three overlooked problems. First, we revisit the conventional practice of using early-reflected speech as the dereverberation target and show that it can degrade perceptual quality and downstream ASR performance. We instead demonstrate that time-shifted anechoic clean speech provides a superior learning target. Second, guided by the distortion--perception tradeoff theory, we propose a simple two-stage framework that achieves minimal distortion under a given level of perceptual quality. Third, we analyze the trade-off between training data scale and quality for USE, revealing that training on large uncurated corpora imposes a performance ceiling, as models struggle to remove subtle artifacts. Our method achieves state-of-the-art performance on the URGENT 2025 non-blind test set and exhibits strong language-agnostic generalization, making it effective for improving TTS training data. Code and models will be released upon acceptance.

Rethinking Training Targets, Architectures and Data Quality for Universal Speech Enhancement

TL;DR

Abstract

Paper Structure (18 sections, 9 equations, 10 figures, 8 tables)

This paper contains 18 sections, 9 equations, 10 figures, 8 tables.

Introduction
Proposed Method
Shifted Anechoic Clean Speech as a Superior Learning Target
Bridging Fidelity and Quality: A Two-Stage Framework
Trade-off Between Training Data Scale and Quality
Experiments
Dataset
Model Architecture
Evaluation Metrics
Results on Training Data Filtering Based on Quality Estimation
Results of Applying Time-Shifted Anechoic Clean Speech as Learning Targets
Results of the Two-Stage Combination of Regression and Generative Models
Comparison with Other Open-Source USE Models
Evaluation on Unseen Languages
Application to TTS Training Data Cleaning
...and 3 more sections

Figures (10)

Figure 1: Motivated by the distortion–perception tradeoff theory, the proposed two-stage framework integrates a frozen regression model with a residual generative model.
Figure 2: Histogram of VQScore for URGENT 2025 Challenge Track 1 subsets. Dashed lines indicate median scores.
Figure 3: Learning curves of UTMOS scores on the validation set under (a) different VQScore filtering thresholds and (b) different learning targets.
Figure 4: An example of a room impulse response, highlighting the time shift $n_0$ introduced by the direct path.
Figure 5: Example illustrating that GANs can focus on correcting over-smoothed regions while leaving other parts unchanged. The noisy speech is bandwidth-limited in the green box, corresponding to a less informative region.
...and 5 more figures

Rethinking Training Targets, Architectures and Data Quality for Universal Speech Enhancement

TL;DR

Abstract

Rethinking Training Targets, Architectures and Data Quality for Universal Speech Enhancement

Authors

TL;DR

Abstract

Table of Contents

Figures (10)