Table of Contents
Fetching ...

FLowHigh: Towards Efficient and High-Quality Audio Super-Resolution with Single-Step Flow Matching

Jun-Hak Yun, Seung-Bin Kim, Seong-Whan Lee

TL;DR

FLowHigh reframes audio super-resolution as conditional distribution learning using flow matching to enable fast, single-step sampling. It introduces a tailored probability path and a transformer-based vector-field estimator to model the HR distribution conditioned on LR mel-spectrograms, followed by vocoder synthesis and LF/HF post-processing. The approach achieves state-of-the-art objective metrics on VCTK with significantly reduced latency compared to diffusion-based methods, and analyses show the data-dependent prior path is beneficial for capturing high-frequency details. This work offers a practical, real-time capable solution for audio bandwidth extension with high fidelity and demonstrates the potential of flow-matching techniques in audio generation tasks.

Abstract

Audio super-resolution is challenging owing to its ill-posed nature. Recently, the application of diffusion models in audio super-resolution has shown promising results in alleviating this challenge. However, diffusion-based models have limitations, primarily the necessity for numerous sampling steps, which causes significantly increased latency when synthesizing high-quality audio samples. In this paper, we propose FLowHigh, a novel approach that integrates flow matching, a highly efficient generative model, into audio super-resolution. We also explore probability paths specially tailored for audio super-resolution, which effectively capture high-resolution audio distributions, thereby enhancing reconstruction quality. The proposed method generates high-fidelity, high-resolution audio through a single-step sampling process across various input sampling rates. The experimental results on the VCTK benchmark dataset demonstrate that FLowHigh achieves state-of-the-art performance in audio super-resolution, as evaluated by log-spectral distance and ViSQOL while maintaining computational efficiency with only a single-step sampling process.

FLowHigh: Towards Efficient and High-Quality Audio Super-Resolution with Single-Step Flow Matching

TL;DR

FLowHigh reframes audio super-resolution as conditional distribution learning using flow matching to enable fast, single-step sampling. It introduces a tailored probability path and a transformer-based vector-field estimator to model the HR distribution conditioned on LR mel-spectrograms, followed by vocoder synthesis and LF/HF post-processing. The approach achieves state-of-the-art objective metrics on VCTK with significantly reduced latency compared to diffusion-based methods, and analyses show the data-dependent prior path is beneficial for capturing high-frequency details. This work offers a practical, real-time capable solution for audio bandwidth extension with high fidelity and demonstrates the potential of flow-matching techniques in audio generation tasks.

Abstract

Audio super-resolution is challenging owing to its ill-posed nature. Recently, the application of diffusion models in audio super-resolution has shown promising results in alleviating this challenge. However, diffusion-based models have limitations, primarily the necessity for numerous sampling steps, which causes significantly increased latency when synthesizing high-quality audio samples. In this paper, we propose FLowHigh, a novel approach that integrates flow matching, a highly efficient generative model, into audio super-resolution. We also explore probability paths specially tailored for audio super-resolution, which effectively capture high-resolution audio distributions, thereby enhancing reconstruction quality. The proposed method generates high-fidelity, high-resolution audio through a single-step sampling process across various input sampling rates. The experimental results on the VCTK benchmark dataset demonstrate that FLowHigh achieves state-of-the-art performance in audio super-resolution, as evaluated by log-spectral distance and ViSQOL while maintaining computational efficiency with only a single-step sampling process.
Paper Structure (16 sections, 7 equations, 2 figures, 3 tables)

This paper contains 16 sections, 7 equations, 2 figures, 3 tables.

Figures (2)

  • Figure 1: Overview of FLowHigh. (a) The overall training and inference process of FLowHigh based on conditional flow matching. (b) The ordinary differential equation trajectory begins from a data-dependent prior distribution. (c) Post-processing using STFT and ISTFT to replace the lower frequency components.
  • Figure 2: Spectrogram visualizations of ground truth, input signal, outputs from baselines, and FLowHigh for a target sample rate of 48 kHz. UDM+ uses 50 NFEs. The input sample rate of an audio sample (p360_239) is 16 kHz.