AudioSR: Versatile Audio Super-resolution at Scale
Haohe Liu, Ke Chen, Qiao Tian, Wenwu Wang, Mark D. Plumbley
TL;DR
AudioSR presents a diffusion-based framework capable of general-domain audio super-resolution across music, speech, and sound effects, addressing bandwidth variability and input-rate differences. The method performs high-resolution Mel spectrogram estimation in a VAE latent space with a Transformer-UNet, guided by a cosine-scheduled latent diffusion process, followed by a neural vocoder to reconstruct waveforms. Post-processing and pre-processing steps preserve low-frequency content and align training-evaluation conditions, enabling robust upscaling from 2–16 kHz inputs to 24 kHz bandwidth at 48 kHz sampling. Empirical results show strong objective and subjective performance gains, with AudioSR acting as a plug-and-play enhancer for AudioLDM, MusicGen, and FastSpeech2. The work highlights practical impact for broad-audio applications and sets directions for real-time SR and evaluation protocol development.
Abstract
Audio super-resolution is a fundamental task that predicts high-frequency components for low-resolution audio, enhancing audio quality in digital applications. Previous methods have limitations such as the limited scope of audio types (e.g., music, speech) and specific bandwidth settings they can handle (e.g., 4kHz to 8kHz). In this paper, we introduce a diffusion-based generative model, AudioSR, that is capable of performing robust audio super-resolution on versatile audio types, including sound effects, music, and speech. Specifically, AudioSR can upsample any input audio signal within the bandwidth range of 2kHz to 16kHz to a high-resolution audio signal at 24kHz bandwidth with a sampling rate of 48kHz. Extensive objective evaluation on various audio super-resolution benchmarks demonstrates the strong result achieved by the proposed model. In addition, our subjective evaluation shows that AudioSR can acts as a plug-and-play module to enhance the generation quality of a wide range of audio generative models, including AudioLDM, Fastspeech2, and MusicGen. Our code and demo are available at https://audioldm.github.io/audiosr.
