UniverSR: Unified and Versatile Audio Super-Resolution via Vocoder-Free Flow Matching
Woongjib Choi, Sangmin Lee, Hyungseob Lim, Hong-Goo Kang
TL;DR
UniverSR tackles the challenge of high-fidelity audio super-resolution by replacing two-stage diffusion-vocoder pipelines with a vocoder-free, flow-matching framework that models the conditional distribution of complex spectral coefficients. By training a Vector Field Estimator with Conditional Flow Matching, it directly generates the high-band spectrum and reconstructs the waveform via iSTFT, enabling end-to-end optimization and reducing reliance on vocoder quality. The method employs a ConvNeXt V2‑based U‑Net conditioned on rich spectral and temporal features, together with classifier-free guidance to balance perceptual richness and fidelity. Evaluations across speech, music, and environmental sounds show state-of-the-art performance for upsampling from 8–24 kHz to 48 kHz, highlighting strong HF reconstruction and perceptual quality without vocoder artifacts. The approach offers practical impact for bandwidth expansion and restoration tasks where vocoder bottlenecks previously limited audio quality and generalization.
Abstract
In this paper, we present a vocoder-free framework for audio super-resolution that employs a flow matching generative model to capture the conditional distribution of complex-valued spectral coefficients. Unlike conventional two-stage diffusion-based approaches that predict a mel-spectrogram and then rely on a pre-trained neural vocoder to synthesize waveforms, our method directly reconstructs waveforms via the inverse Short-Time Fourier Transform (iSTFT), thereby eliminating the dependence on a separate vocoder. This design not only simplifies end-to-end optimization but also overcomes a critical bottleneck of two-stage pipelines, where the final audio quality is fundamentally constrained by vocoder performance. Experiments show that our model consistently produces high-fidelity 48 kHz audio across diverse upsampling factors, achieving state-of-the-art performance on both speech and general audio datasets.
