Query-Based Asymmetric Modeling with Decoupled Input-Output Rates for Speech Restoration

Ui-Hyeop Shin; Jaehyun Ko; Woocheol Jeong; Hyung-Min Park

Query-Based Asymmetric Modeling with Decoupled Input-Output Rates for Speech Restoration

Ui-Hyeop Shin, Jaehyun Ko, Woocheol Jeong, Hyung-Min Park

TL;DR

TF-Restormer tackles speech restoration under decoupled input-output sampling rates by introducing a query-based asymmetric TF encoder–decoder. The heavy encoder analyzes the observed input bandwidth, while a lightweight decoder uses learnable extension queries and cross-attention to synthesize missing high-frequency content, enabling arbitrary $(f_E,f_D)$ without external resampling. Training combines SSL-based perceptual objectives with a novel scaled log-spectral loss and adversarial losses implemented via a shared multi-scale SFI-STFT discriminator, promoting stable optimization across diverse degradations and rates. The approach yields balanced improvements in signal fidelity and perceptual quality, supports streaming inference, and provides a unified framework for restoration tasks including denoising, super-resolution, and bandwidth extension with robust ablations validating architectural choices.

Abstract

Speech restoration in real-world conditions is challenging due to compounded distortions and mismatches between input and desired output rates. Most existing systems assume a fixed and shared input-output rate, relying on external resampling that incurs redundant computation and limits generality. We address this setting by formulating speech restoration under decoupled input-output rates, and propose TF-Restormer, a query-based asymmetric modeling framework. The encoder concentrates analysis on the observed input bandwidth using a time-frequency dual-path architecture, while a lightweight decoder reconstructs missing spectral content via frequency extension queries. This design enables a single model to operate consistently across arbitrary input-output rate pairs without redundant resampling. Experiments across diverse sampling rates, degradations, and operating modes show that TF-Restormer maintains stable restoration behavior and balanced perceptual quality, including in real-time streaming scenarios. Code and demos are available at https://tf-restormer.github.io/demo.

Query-Based Asymmetric Modeling with Decoupled Input-Output Rates for Speech Restoration

TL;DR

Abstract

Query-Based Asymmetric Modeling with Decoupled Input-Output Rates for Speech Restoration

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (5)