Table of Contents
Fetching ...

RFWave: Multi-band Rectified Flow for Audio Waveform Reconstruction

Peng Liu, Dongyang Dai, Zhiyong Wu

TL;DR

RFWave tackles the latency bottleneck of diffusion-based audio waveform reconstruction by introducing a multi-band Rectified Flow framework that operates at STFT-frame level and processes all subbands in parallel. By combining a ConvNeXtV2 backbone, energy-aware losses, and an Euler-time sampling strategy based on trajectory straightness, it achieves high-fidelity waveform reconstruction with as few as 10 sampling steps, delivering up to 160x real-time speed on high-end GPUs. The method demonstrates strong performance across Mel-spectrogram and Encodec-token inputs, surpassing diffusion baselines and rivaling GAN-based vocoders, with superior generalization in out-of-domain tasks. Practically, RFWave offers a versatile, efficient solution for real-time audio generation and reconstruction in diverse applications, including speech, music, and environmental sound domains.

Abstract

Recent advancements in generative modeling have significantly enhanced the reconstruction of audio waveforms from various representations. While diffusion models are adept at this task, they are hindered by latency issues due to their operation at the individual sample point level and the need for numerous sampling steps. In this study, we introduce RFWave, a cutting-edge multi-band Rectified Flow approach designed to reconstruct high-fidelity audio waveforms from Mel-spectrograms or discrete acoustic tokens. RFWave uniquely generates complex spectrograms and operates at the frame level, processing all subbands simultaneously to boost efficiency. Leveraging Rectified Flow, which targets a straight transport trajectory, RFWave achieves reconstruction with just 10 sampling steps. Our empirical evaluations show that RFWave not only provides outstanding reconstruction quality but also offers vastly superior computational efficiency, enabling audio generation at speeds up to 160 times faster than real-time on a GPU. An online demonstration is available at: https://rfwave-demo.github.io/rfwave/.

RFWave: Multi-band Rectified Flow for Audio Waveform Reconstruction

TL;DR

RFWave tackles the latency bottleneck of diffusion-based audio waveform reconstruction by introducing a multi-band Rectified Flow framework that operates at STFT-frame level and processes all subbands in parallel. By combining a ConvNeXtV2 backbone, energy-aware losses, and an Euler-time sampling strategy based on trajectory straightness, it achieves high-fidelity waveform reconstruction with as few as 10 sampling steps, delivering up to 160x real-time speed on high-end GPUs. The method demonstrates strong performance across Mel-spectrogram and Encodec-token inputs, surpassing diffusion baselines and rivaling GAN-based vocoders, with superior generalization in out-of-domain tasks. Practically, RFWave offers a versatile, efficient solution for real-time audio generation and reconstruction in diverse applications, including speech, music, and environmental sound domains.

Abstract

Recent advancements in generative modeling have significantly enhanced the reconstruction of audio waveforms from various representations. While diffusion models are adept at this task, they are hindered by latency issues due to their operation at the individual sample point level and the need for numerous sampling steps. In this study, we introduce RFWave, a cutting-edge multi-band Rectified Flow approach designed to reconstruct high-fidelity audio waveforms from Mel-spectrograms or discrete acoustic tokens. RFWave uniquely generates complex spectrograms and operates at the frame level, processing all subbands simultaneously to boost efficiency. Leveraging Rectified Flow, which targets a straight transport trajectory, RFWave achieves reconstruction with just 10 sampling steps. Our empirical evaluations show that RFWave not only provides outstanding reconstruction quality but also offers vastly superior computational efficiency, enabling audio generation at speeds up to 160 times faster than real-time on a GPU. An online demonstration is available at: https://rfwave-demo.github.io/rfwave/.
Paper Structure (42 sections, 9 equations, 11 figures, 13 tables)

This paper contains 42 sections, 9 equations, 11 figures, 13 tables.

Figures (11)

  • Figure 1: The overall structure for RFWave. $i_{sb}$ is the subband index, C is the conditional input, which can be Encodec token or Mel-spectrogram, and $i_{bw}$ is the EnCodec bandwidth index. Modules enclosed in a dashed box, as well as dashed arrows, are considered optional.
  • Figure 2: An illustration of dividing complex spectrograms into subbands. The area highlighted in pink represents a subband, while the section enclosed by the two dashed vertical lines indicates the main section.
  • Figure A.1: The energy and weighting coefficients, represented by $\sigma$, display a consistent variation throughout the frames.
  • Figure A.2: Deviation (top) and straightness (bottom) over time: red dots mark Euler method time points, with constant increase in straightness across intervals.
  • Figure A.3: Examples of spectrograms from Opencpop
  • ...and 6 more figures