Table of Contents
Fetching ...

Query-Based Asymmetric Modeling with Decoupled Input-Output Rates for Speech Restoration

Ui-Hyeop Shin, Jaehyun Ko, Woocheol Jeong, Hyung-Min Park

TL;DR

TF-Restormer tackles speech restoration under decoupled input-output sampling rates by introducing a query-based asymmetric TF encoder–decoder. The heavy encoder analyzes the observed input bandwidth, while a lightweight decoder uses learnable extension queries and cross-attention to synthesize missing high-frequency content, enabling arbitrary $(f_E,f_D)$ without external resampling. Training combines SSL-based perceptual objectives with a novel scaled log-spectral loss and adversarial losses implemented via a shared multi-scale SFI-STFT discriminator, promoting stable optimization across diverse degradations and rates. The approach yields balanced improvements in signal fidelity and perceptual quality, supports streaming inference, and provides a unified framework for restoration tasks including denoising, super-resolution, and bandwidth extension with robust ablations validating architectural choices.

Abstract

Speech restoration in real-world conditions is challenging due to compounded distortions and mismatches between input and desired output rates. Most existing systems assume a fixed and shared input-output rate, relying on external resampling that incurs redundant computation and limits generality. We address this setting by formulating speech restoration under decoupled input-output rates, and propose TF-Restormer, a query-based asymmetric modeling framework. The encoder concentrates analysis on the observed input bandwidth using a time-frequency dual-path architecture, while a lightweight decoder reconstructs missing spectral content via frequency extension queries. This design enables a single model to operate consistently across arbitrary input-output rate pairs without redundant resampling. Experiments across diverse sampling rates, degradations, and operating modes show that TF-Restormer maintains stable restoration behavior and balanced perceptual quality, including in real-time streaming scenarios. Code and demos are available at https://tf-restormer.github.io/demo.

Query-Based Asymmetric Modeling with Decoupled Input-Output Rates for Speech Restoration

TL;DR

TF-Restormer tackles speech restoration under decoupled input-output sampling rates by introducing a query-based asymmetric TF encoder–decoder. The heavy encoder analyzes the observed input bandwidth, while a lightweight decoder uses learnable extension queries and cross-attention to synthesize missing high-frequency content, enabling arbitrary without external resampling. Training combines SSL-based perceptual objectives with a novel scaled log-spectral loss and adversarial losses implemented via a shared multi-scale SFI-STFT discriminator, promoting stable optimization across diverse degradations and rates. The approach yields balanced improvements in signal fidelity and perceptual quality, supports streaming inference, and provides a unified framework for restoration tasks including denoising, super-resolution, and bandwidth extension with robust ablations validating architectural choices.

Abstract

Speech restoration in real-world conditions is challenging due to compounded distortions and mismatches between input and desired output rates. Most existing systems assume a fixed and shared input-output rate, relying on external resampling that incurs redundant computation and limits generality. We address this setting by formulating speech restoration under decoupled input-output rates, and propose TF-Restormer, a query-based asymmetric modeling framework. The encoder concentrates analysis on the observed input bandwidth using a time-frequency dual-path architecture, while a lightweight decoder reconstructs missing spectral content via frequency extension queries. This design enables a single model to operate consistently across arbitrary input-output rate pairs without redundant resampling. Experiments across diverse sampling rates, degradations, and operating modes show that TF-Restormer maintains stable restoration behavior and balanced perceptual quality, including in real-time streaming scenarios. Code and demos are available at https://tf-restormer.github.io/demo.

Paper Structure

This paper contains 54 sections, 4 equations, 5 figures, 15 tables.

Figures (5)

  • Figure 1: Overall architecture of TF-Restormer. The framework employs a query-based asymmetric design to handle decoupled input--output rates. The heavy TF-Encoder focuses on analysis within the native input bandwidth, while the lightweight TF-Decoder reconstructs missing high-frequency bands using learnable extension queries, bypassing the need for redundant resampling.
  • Figure 2: Unit modules in TF-encoder and decoder is based on (a) the time self module based on MHSA with RoPE and (b) the frequency module based on MHSA with frequency projection layer. The frequency cross-self module employs MHCA based on key/value from the encoder features while frequency self module is based on two ConvFFNs similarity to time self module.
  • Figure 3: Noisy-distorted speech input simulation pipeline. The simulation procedure is partitioned to physical distortion and digital distortion.
  • Figure 4: Gradient profiles of the proposed scaled log-spectral loss $\partial \ell / \partial d = w / (d + w)$ for different scale factors $w$. The curves show that the gradient is $1$ near zero error and monotonically decreases as the distance $d=|y-s|$ grows. Smaller $w$ values make the loss more sensitive to fine spectral deviations, while larger $w$ values maintain stronger gradients over broader error ranges.
  • Figure 5: Unit modules in TF-Encoder and TF-Decoder. The (a) time module is based on MHSA with RoPE while (b) the frequency encoder module is based on MHSA with frequency projection layer. (c) The frequency decoder module utilize MHCA based on key/value from the encoder features