Vector Quantized Diffusion Model Based Speech Bandwidth Extension

Yuan Fang; Jinglin Bai; Jiajie Wang; Xueliang Zhang

Vector Quantized Diffusion Model Based Speech Bandwidth Extension

Yuan Fang, Jinglin Bai, Jiajie Wang, Xueliang Zhang

TL;DR

This paper introduces the first approach to speech bandwidth extension (BWE) that utilizes the discrete features obtained from NAC by restoring high-frequency details within highly compressed discrete tokens, which enhances speech intelligibility and naturalness.

Abstract

Recent advancements in neural audio codec (NAC) unlock new potential in audio signal processing. Studies have increasingly explored leveraging the latent features of NAC for various speech signal processing tasks. This paper introduces the first approach to speech bandwidth extension (BWE) that utilizes the discrete features obtained from NAC. By restoring high-frequency details within highly compressed discrete tokens, this approach enhances speech intelligibility and naturalness. Based on Vector Quantized Diffusion, the proposed framework combines the strengths of advanced NAC, diffusion models, and Mamba-2 to reconstruct high-frequency speech components. Extensive experiments demonstrate that this method exhibits superior performance across both log-spectral distance and ViSQOL, significantly improving speech quality.

Vector Quantized Diffusion Model Based Speech Bandwidth Extension

TL;DR

Abstract

Paper Structure (19 sections, 3 equations, 2 figures, 3 tables)

This paper contains 19 sections, 3 equations, 2 figures, 3 tables.

Introduction
Method
Descript Audio Codec (DAC)
VQDiffusion-BWE
Forward Process
Reverse Process
Network
Feature Extraction Module
Mamba2-DPM
Experiments
Data Configuration
Implementation Details
Evaluation Metrics
Log-Spectral Distance (LSD)
Virtual Speech Quality Objective Listener (ViSQOL)
...and 4 more sections

Figures (2)

Figure 1: Overview of the proposed model. (a) The structure of Descript Audio Codec, (b) The VQ-Diffusion process, where the top part represents the forward process and the bottom part shows the step of estimating the previous $x_{t-1}$ through a neural network, and (c) The structure of proposed ConMamba2.
Figure 2: Visualization of spectorgrams of reference and upsampled speeches (p360_033) for 16 kHz input. Red lines indicate the Nyquist frequencies of downsampled signals.

Vector Quantized Diffusion Model Based Speech Bandwidth Extension

TL;DR

Abstract

Vector Quantized Diffusion Model Based Speech Bandwidth Extension

Authors

TL;DR

Abstract

Table of Contents

Figures (2)