Table of Contents
Fetching ...

UniverSR: Unified and Versatile Audio Super-Resolution via Vocoder-Free Flow Matching

Woongjib Choi, Sangmin Lee, Hyungseob Lim, Hong-Goo Kang

TL;DR

UniverSR tackles the challenge of high-fidelity audio super-resolution by replacing two-stage diffusion-vocoder pipelines with a vocoder-free, flow-matching framework that models the conditional distribution of complex spectral coefficients. By training a Vector Field Estimator with Conditional Flow Matching, it directly generates the high-band spectrum and reconstructs the waveform via iSTFT, enabling end-to-end optimization and reducing reliance on vocoder quality. The method employs a ConvNeXt V2‑based U‑Net conditioned on rich spectral and temporal features, together with classifier-free guidance to balance perceptual richness and fidelity. Evaluations across speech, music, and environmental sounds show state-of-the-art performance for upsampling from 8–24 kHz to 48 kHz, highlighting strong HF reconstruction and perceptual quality without vocoder artifacts. The approach offers practical impact for bandwidth expansion and restoration tasks where vocoder bottlenecks previously limited audio quality and generalization.

Abstract

In this paper, we present a vocoder-free framework for audio super-resolution that employs a flow matching generative model to capture the conditional distribution of complex-valued spectral coefficients. Unlike conventional two-stage diffusion-based approaches that predict a mel-spectrogram and then rely on a pre-trained neural vocoder to synthesize waveforms, our method directly reconstructs waveforms via the inverse Short-Time Fourier Transform (iSTFT), thereby eliminating the dependence on a separate vocoder. This design not only simplifies end-to-end optimization but also overcomes a critical bottleneck of two-stage pipelines, where the final audio quality is fundamentally constrained by vocoder performance. Experiments show that our model consistently produces high-fidelity 48 kHz audio across diverse upsampling factors, achieving state-of-the-art performance on both speech and general audio datasets.

UniverSR: Unified and Versatile Audio Super-Resolution via Vocoder-Free Flow Matching

TL;DR

UniverSR tackles the challenge of high-fidelity audio super-resolution by replacing two-stage diffusion-vocoder pipelines with a vocoder-free, flow-matching framework that models the conditional distribution of complex spectral coefficients. By training a Vector Field Estimator with Conditional Flow Matching, it directly generates the high-band spectrum and reconstructs the waveform via iSTFT, enabling end-to-end optimization and reducing reliance on vocoder quality. The method employs a ConvNeXt V2‑based U‑Net conditioned on rich spectral and temporal features, together with classifier-free guidance to balance perceptual richness and fidelity. Evaluations across speech, music, and environmental sounds show state-of-the-art performance for upsampling from 8–24 kHz to 48 kHz, highlighting strong HF reconstruction and perceptual quality without vocoder artifacts. The approach offers practical impact for bandwidth expansion and restoration tasks where vocoder bottlenecks previously limited audio quality and generalization.

Abstract

In this paper, we present a vocoder-free framework for audio super-resolution that employs a flow matching generative model to capture the conditional distribution of complex-valued spectral coefficients. Unlike conventional two-stage diffusion-based approaches that predict a mel-spectrogram and then rely on a pre-trained neural vocoder to synthesize waveforms, our method directly reconstructs waveforms via the inverse Short-Time Fourier Transform (iSTFT), thereby eliminating the dependence on a separate vocoder. This design not only simplifies end-to-end optimization but also overcomes a critical bottleneck of two-stage pipelines, where the final audio quality is fundamentally constrained by vocoder performance. Experiments show that our model consistently produces high-fidelity 48 kHz audio across diverse upsampling factors, achieving state-of-the-art performance on both speech and general audio datasets.

Paper Structure

This paper contains 14 sections, 4 equations, 4 figures, 3 tables.

Figures (4)

  • Figure 1: Overall framework of UniverSR showing (a) training stage and (b) inference stage. Specifically, the ODE solver includes a feature encoder and vector field estimator.
  • Figure 2: Detailed architecture of the (a) vector field estimator (VFE) and (b) feature encoder. Encoder, bottleneck, and decoder blocks of the VFE consist of a stack of ConvNeXt V2 blocks.
  • Figure 3: Subjective evaluation results (MOS) with 95% confidence intervals for 8 kHz to 48 kHz upsampling. Dashed lines indicate separation between classes.
  • Figure 4: Spectrograms of a harmonic instrumental sample. The bottom row displays magnified views of the regions enclosed by white rectangles in the top row. "Prop." denotes our proposed model with a classifier-free guidance scale $\omega$.