Table of Contents
Fetching ...

Audio Super-Resolution with Latent Bridge Models

Chang Li, Zehua Chen, Liyuan Wang, Jun Zhu

TL;DR

AudioLBM introduces a latent-to-latent bridge framework for audio super-resolution that directly leverages informative LR waveforms. By embedding waveforms into a continuous latent space via a VAE and connecting LR and HR latents with a Schrödinger-bridge-inspired process, the method achieves high-fidelity upsampling. Frequency-aware conditioning expands training to any-to-any bandwidth, and cascading LBMs enable SR beyond 48 kHz with prior augmentation to mitigate cascading errors. Across speech, sound effects, and music, AudioLBM delivers state-of-the-art objective and perceptual quality, including first demonstrations of 96 kHz and 192 kHz SR, demonstrating strong generalization and practical utility for audio production. The approach advances SR by aligning priors with target generation and providing scalable, flexible upsampling while highlighting considerations for responsible deployment.

Abstract

Audio super-resolution (SR), i.e., upsampling the low-resolution (LR) waveform to the high-resolution (HR) version, has recently been explored with diffusion and bridge models, while previous methods often suffer from sub-optimal upsampling quality due to their uninformative generation prior. Towards high-quality audio super-resolution, we present a new system with latent bridge models (LBMs), where we compress the audio waveform into a continuous latent space and design an LBM to enable a latent-to-latent generation process that naturally matches the LR-toHR upsampling process, thereby fully exploiting the instructive prior information contained in the LR waveform. To further enhance the training results despite the limited availability of HR samples, we introduce frequency-aware LBMs, where the prior and target frequency are taken as model input, enabling LBMs to explicitly learn an any-to-any upsampling process at the training stage. Furthermore, we design cascaded LBMs and present two prior augmentation strategies, where we make the first attempt to unlock the audio upsampling beyond 48 kHz and empower a seamless cascaded SR process, providing higher flexibility for audio post-production. Comprehensive experimental results evaluated on the VCTK, ESC-50, Song-Describer benchmark datasets and two internal testsets demonstrate that we achieve state-of-the-art objective and perceptual quality for any-to-48kHz SR across speech, audio, and music signals, as well as setting the first record for any-to-192kHz audio SR. Demo at https://AudioLBM.github.io/.

Audio Super-Resolution with Latent Bridge Models

TL;DR

AudioLBM introduces a latent-to-latent bridge framework for audio super-resolution that directly leverages informative LR waveforms. By embedding waveforms into a continuous latent space via a VAE and connecting LR and HR latents with a Schrödinger-bridge-inspired process, the method achieves high-fidelity upsampling. Frequency-aware conditioning expands training to any-to-any bandwidth, and cascading LBMs enable SR beyond 48 kHz with prior augmentation to mitigate cascading errors. Across speech, sound effects, and music, AudioLBM delivers state-of-the-art objective and perceptual quality, including first demonstrations of 96 kHz and 192 kHz SR, demonstrating strong generalization and practical utility for audio production. The approach advances SR by aligning priors with target generation and providing scalable, flexible upsampling while highlighting considerations for responsible deployment.

Abstract

Audio super-resolution (SR), i.e., upsampling the low-resolution (LR) waveform to the high-resolution (HR) version, has recently been explored with diffusion and bridge models, while previous methods often suffer from sub-optimal upsampling quality due to their uninformative generation prior. Towards high-quality audio super-resolution, we present a new system with latent bridge models (LBMs), where we compress the audio waveform into a continuous latent space and design an LBM to enable a latent-to-latent generation process that naturally matches the LR-toHR upsampling process, thereby fully exploiting the instructive prior information contained in the LR waveform. To further enhance the training results despite the limited availability of HR samples, we introduce frequency-aware LBMs, where the prior and target frequency are taken as model input, enabling LBMs to explicitly learn an any-to-any upsampling process at the training stage. Furthermore, we design cascaded LBMs and present two prior augmentation strategies, where we make the first attempt to unlock the audio upsampling beyond 48 kHz and empower a seamless cascaded SR process, providing higher flexibility for audio post-production. Comprehensive experimental results evaluated on the VCTK, ESC-50, Song-Describer benchmark datasets and two internal testsets demonstrate that we achieve state-of-the-art objective and perceptual quality for any-to-48kHz SR across speech, audio, and music signals, as well as setting the first record for any-to-192kHz audio SR. Demo at https://AudioLBM.github.io/.

Paper Structure

This paper contains 78 sections, 17 equations, 14 figures, 15 tables.

Figures (14)

  • Figure 1: AudioLBM significantly improves the perceptual quality of text-to-speech wang2024maskgct, text-to-audio liu2024audioldm, and text-to-music li2024quality generation, and outperforms the state-of-the-art any-to-48 kHz SR system AudioSR liu2024audiosr.
  • Figure 2: The top part shows how the low-resolution waveform is simulated during training via low-pass filtering, the middle part depicts the baseline method AudioSR liu2024audiosr that synthesizes high-resolution content from Gaussian noise, and the bottom part presents overview of our proposed AudioLBM. It learns a latent-to-latent generation process between the low- and high-resolution waveform latent representations, namely $\bm{z_x}^{\text{LR}} \in \mathbb{R}^{c_x \times \frac{L}{r_x}}$ and $\bm{z_x}^{\text{HR}} \in \mathbb{R}^{c_x \times \frac{L}{r_x}}$, where $L$ is waveform length. $c_x$ and $r_x$ are the channel dimension and compression ratio of waveform latent. In contrast, AudioSR operates in latent space $\bm{z_X}\in\mathbb{R}^{c_X \times \frac{F}{r_X} \times \frac{T}{r_X}}$ of mel-spectrogram ${X} \in \mathbb{R}^{F \times T}$ with a noise-to-latent generation process, where $T$ and $F$ denote time and frequency bins of mel-spectrogram; $c_X$ and $r_X$ denote channel dimension and compression ratio of mel-spectrogram latent.
  • Figure 3: LBMs can be naturally extended into higher-resolution waveform generation with a cascaded paradigm, where prior augmentation is utilized to avoid cascading artifacts and accumulating errors between stages.
  • Figure 4: Ablation results on 96Audio and 96Music in the 16$\rightarrow$96 kHz setting.
  • Figure 5: For case studies, we present the linear-amplitude STFT spectrograms of a 1.5-second speech segment from the VCTK-test set (sample p360_102) and a 5.12-second music clip.
  • ...and 9 more figures