Table of Contents
Fetching ...

TS-URGENet: A Three-stage Universal Robust and Generalizable Speech Enhancement Network

Xiaobin Rong, Dahan Wang, Qinwen Hu, Yushi Wang, Yuxiang Hu, Jing Lu

TL;DR

TS-URGENet tackles universal speech enhancement under diverse distortions and sampling-rate variations by a three-stage pipeline: filling to address packet loss, separation to suppress noise, reverberation, and clipping, and restoration to recover bandwidth and codec artifacts. The framework employs TF-GridNet variants for each stage and introduces metric-aware and joint stage–metric-aware fine-tuning (MAFT and JFT) along with bandwidth-aware inference (BWAI) to balance multiple objective metrics. Key technical components include GAN-based filling with MPD/MRD discriminators, MAFT with a combined loss that integrates MCD, PESQ, DNSMOS, UTMOS, and WavLM-based representations, and CWS-TF-GridNet with subband processing for restoration, all operating in the time-frequency domain with STFT/iSTFT. Empirically, TS-URGENet achieves competitive results on the Interspeech 2025 URGENT Challenge Track 1, ranking 2nd, and demonstrates strong improvements across several perceptual and objective metrics, validating the value of staged processing and metric-aware optimization for universal SE. The approach has practical implications for robust speech enhancement in real-world, bandwidth- and distortion-constrained communications pipelines.

Abstract

Universal speech enhancement aims to handle input speech with different distortions and input formats. To tackle this challenge, we present TS-URGENet, a Three-Stage Universal, Robust, and Generalizable speech Enhancement Network. To address various distortions, the proposed system employs a novel three-stage architecture consisting of a filling stage, a separation stage, and a restoration stage. The filling stage mitigates packet loss by preliminarily filling lost regions under noise interference, ensuring signal continuity. The separation stage suppresses noise, reverberation, and clipping distortion to improve speech clarity. Finally, the restoration stage compensates for bandwidth limitation, codec artifacts, and residual packet loss distortion, refining the overall speech quality. Our proposed TS-URGENet achieved outstanding performance in the Interspeech 2025 URGENT Challenge, ranking 2nd in Track 1.

TS-URGENet: A Three-stage Universal Robust and Generalizable Speech Enhancement Network

TL;DR

TS-URGENet tackles universal speech enhancement under diverse distortions and sampling-rate variations by a three-stage pipeline: filling to address packet loss, separation to suppress noise, reverberation, and clipping, and restoration to recover bandwidth and codec artifacts. The framework employs TF-GridNet variants for each stage and introduces metric-aware and joint stage–metric-aware fine-tuning (MAFT and JFT) along with bandwidth-aware inference (BWAI) to balance multiple objective metrics. Key technical components include GAN-based filling with MPD/MRD discriminators, MAFT with a combined loss that integrates MCD, PESQ, DNSMOS, UTMOS, and WavLM-based representations, and CWS-TF-GridNet with subband processing for restoration, all operating in the time-frequency domain with STFT/iSTFT. Empirically, TS-URGENet achieves competitive results on the Interspeech 2025 URGENT Challenge Track 1, ranking 2nd, and demonstrates strong improvements across several perceptual and objective metrics, validating the value of staged processing and metric-aware optimization for universal SE. The approach has practical implications for robust speech enhancement in real-world, bandwidth- and distortion-constrained communications pipelines.

Abstract

Universal speech enhancement aims to handle input speech with different distortions and input formats. To tackle this challenge, we present TS-URGENet, a Three-Stage Universal, Robust, and Generalizable speech Enhancement Network. To address various distortions, the proposed system employs a novel three-stage architecture consisting of a filling stage, a separation stage, and a restoration stage. The filling stage mitigates packet loss by preliminarily filling lost regions under noise interference, ensuring signal continuity. The separation stage suppresses noise, reverberation, and clipping distortion to improve speech clarity. Finally, the restoration stage compensates for bandwidth limitation, codec artifacts, and residual packet loss distortion, refining the overall speech quality. Our proposed TS-URGENet achieved outstanding performance in the Interspeech 2025 URGENT Challenge, ranking 2nd in Track 1.

Paper Structure

This paper contains 16 sections, 5 equations, 1 figure, 4 tables.

Figures (1)

  • Figure 1: Overview of the TS-URGENet framework.