Table of Contents
Fetching ...

FlashSR: One-step Versatile Audio Super-resolution via Diffusion Distillation

Jaekwon Im, Juhan Nam

TL;DR

FlashSR tackles slow diffusion-based audio super-resolution by introducing a one-step diffusion model trained via diffusion distillation from a strong teacher (AudioSR) and a dedicated SR Vocoder for end-to-end waveform generation. The method combines distillation loss, distribution matching distillation loss, and adversarial loss to achieve high-quality high-resolution audio with only 1 neural function evaluation (NFE), significantly speeding up inference by about 22x. It delivers competitive objective metrics and superior subjective quality across speech, music, and sound effects, while eliminating the need for low-frequency post-processing. The SR Vocoder further improves waveform realism by conditioning on both mel-spectrograms and the low-resolution input, enabling natural high-frequency continuity.

Abstract

Versatile audio super-resolution (SR) is the challenging task of restoring high-frequency components from low-resolution audio with sampling rates between 4kHz and 32kHz in various domains such as music, speech, and sound effects. Previous diffusion-based SR methods suffer from slow inference due to the need for a large number of sampling steps. In this paper, we introduce FlashSR, a single-step diffusion model for versatile audio super-resolution aimed at producing 48kHz audio. FlashSR achieves fast inference by utilizing diffusion distillation with three objectives: distillation loss, adversarial loss, and distribution-matching distillation loss. We further enhance performance by proposing the SR Vocoder, which is specifically designed for SR models operating on mel-spectrograms. FlashSR demonstrates competitive performance with the current state-of-the-art model in both objective and subjective evaluations while being approximately 22 times faster.

FlashSR: One-step Versatile Audio Super-resolution via Diffusion Distillation

TL;DR

FlashSR tackles slow diffusion-based audio super-resolution by introducing a one-step diffusion model trained via diffusion distillation from a strong teacher (AudioSR) and a dedicated SR Vocoder for end-to-end waveform generation. The method combines distillation loss, distribution matching distillation loss, and adversarial loss to achieve high-quality high-resolution audio with only 1 neural function evaluation (NFE), significantly speeding up inference by about 22x. It delivers competitive objective metrics and superior subjective quality across speech, music, and sound effects, while eliminating the need for low-frequency post-processing. The SR Vocoder further improves waveform realism by conditioning on both mel-spectrograms and the low-resolution input, enabling natural high-frequency continuity.

Abstract

Versatile audio super-resolution (SR) is the challenging task of restoring high-frequency components from low-resolution audio with sampling rates between 4kHz and 32kHz in various domains such as music, speech, and sound effects. Previous diffusion-based SR methods suffer from slow inference due to the need for a large number of sampling steps. In this paper, we introduce FlashSR, a single-step diffusion model for versatile audio super-resolution aimed at producing 48kHz audio. FlashSR achieves fast inference by utilizing diffusion distillation with three objectives: distillation loss, adversarial loss, and distribution-matching distillation loss. We further enhance performance by proposing the SR Vocoder, which is specifically designed for SR models operating on mel-spectrograms. FlashSR demonstrates competitive performance with the current state-of-the-art model in both objective and subjective evaluations while being approximately 22 times faster.
Paper Structure (12 sections, 4 equations, 3 figures, 2 tables, 1 algorithm)

This paper contains 12 sections, 4 equations, 3 figures, 2 tables, 1 algorithm.

Figures (3)

  • Figure 1: Overview of FlashSR.
  • Figure 2: Spectrogram of compared models.
  • Figure 3: Inverse real time factor (RTF)