DSP-informed bandwidth extension using locally-conditioned excitation and linear time-varying filter subnetworks
Shahan Nercessian, Alexey Lukin, Johannes Imort
TL;DR
The paper addresses bandwidth extension from 8 kHz to 48 kHz by modeling BWE as a dual-stage process: a spectrum-exciting exciter followed by a linear time-varying (LTV) filter guided by outputs from an acoustic feature predictor. A differentiable STFT and a compressed spectral representation drive the LTV filter through a nonnegative spectral response, with an acoustic-feature loss applied to upper spectral bands to encourage flat high-frequency energy. The authors adapt existing exciters into locally-conditioned variants (HiFi-GAN-2 and SEANet-2) and demonstrate improved fidelity and robustness against baselines on the VCTK dataset, including subjective MUSHRA scores. They highlight the inductive bias of the DSP-informed architecture and discuss future directions, such as pink-noise-shaped excitations and integration with diffusion-based BWE models like NuWave2, for practical on-device deployment.
Abstract
In this paper, we propose a dual-stage architecture for bandwidth extension (BWE) increasing the effective sampling rate of speech signals from 8 kHz to 48 kHz. Unlike existing end-to-end deep learning models, our proposed method explicitly models BWE using excitation and linear time-varying (LTV) filter stages. The excitation stage broadens the spectrum of the input, while the filtering stage properly shapes it based on outputs from an acoustic feature predictor. To this end, an acoustic feature loss term can implicitly promote the excitation subnetwork to produce white spectra in the upper frequency band to be synthesized. Experimental results demonstrate that the added inductive bias provided by our approach can improve upon BWE results using the generators from both SEANet or HiFi-GAN as exciters, and that our means of adapting processing with acoustic feature predictions is more effective than that used in HiFi-GAN-2. Secondary contributions include extensions of the SEANet model to accommodate local conditioning information, as well as the application of HiFi-GAN-2 for the BWE problem.
