Table of Contents
Fetching ...

Fast and Flexible Audio Bandwidth Extension via Vocos

Yatharth Sharma

TL;DR

A Vocos-based bandwidth extension model that enhances audio at 8-48 kHz by generating missing high-frequency content by using a lightweight Linkwitz-Riley-inspired refiner and a neural vocoder backbone, demonstrating practical, high-quality BWE at extreme throughput.

Abstract

We propose a Vocos-based bandwidth extension model that enhances audio at 8-48 kHz by generating missing high-frequency content. Inputs are resampled to 48 kHz and processed by a neural vocoder backbone, enabling a single network to support arbitrary upsampling ratios. A lightweight Linkwitz-Riley-inspired refiner merges the original low band with the generated high frequencies via a smooth crossover. On validation, the model achieves competitive log-spectral distance while running at a real-time factor of 0.0001 on an NVIDIA A100 GPU and 0.0053 on an 8-core CPU, demonstrating practical, high-quality BWE at extreme throughput.

Fast and Flexible Audio Bandwidth Extension via Vocos

TL;DR

A Vocos-based bandwidth extension model that enhances audio at 8-48 kHz by generating missing high-frequency content by using a lightweight Linkwitz-Riley-inspired refiner and a neural vocoder backbone, demonstrating practical, high-quality BWE at extreme throughput.

Abstract

We propose a Vocos-based bandwidth extension model that enhances audio at 8-48 kHz by generating missing high-frequency content. Inputs are resampled to 48 kHz and processed by a neural vocoder backbone, enabling a single network to support arbitrary upsampling ratios. A lightweight Linkwitz-Riley-inspired refiner merges the original low band with the generated high frequencies via a smooth crossover. On validation, the model achieves competitive log-spectral distance while running at a real-time factor of 0.0001 on an NVIDIA A100 GPU and 0.0053 on an 8-core CPU, demonstrating practical, high-quality BWE at extreme throughput.
Paper Structure (19 sections, 4 equations, 2 figures, 5 tables)

This paper contains 19 sections, 4 equations, 2 figures, 5 tables.

Figures (2)

  • Figure 1: Overview of the proposed BWE architecture following the Vocos framework. The input audio is upsampled to 48 kHz via Sinc interpolation and transformed into an 80-bin Mel-spectrogram. The backbone consists of 8 ConvNeXt blocks using $7 \times 1$ depthwise and $1 \times 1$ pointwise convolutions. The head employs a linear layer and ISTFT for waveform reconstruction, followed by a Linkwitz-Riley inspired refiner for significant quality enhancement.
  • Figure 2: Model performance (LSD) across in-domain and out-of-domain (OOD) sample rates. The curve demonstrates a linear improvement in fidelity as input bandwidth increases, regardless of whether the rate was seen during training.