Table of Contents
Fetching ...

BridgeVoC: Revitalizing Neural Vocoder from a Restoration Perspective

Andong Li, Tong Lei, Rilin Chen, Kai Li, Meng Yu, Xiaodong Li, Dong Yu, Chengshi Zheng

TL;DR

BridgeVoC reframes neural vocoding as an audio restoration problem by exploiting a range-space spectral (RSS) surrogate of the Mel-spectrum and connects source and target via a Schrödinger bridge to reduce the diffusion trajectory. It introduces a subband-aware diffusion network (BCD) with uneven subband divisions and a large-kernel convolutional attention module (LKCAM) to model time-frequency dependencies efficiently. A novel omnidirectional distillation loss enables effective single-step generation, supplemented by target-related and bijective consistency losses, yielding state-of-the-art results with as few as 4 inference steps and successful single-step distillation. Extensive experiments on LibriTTS, LibriTTS-out-of-distribution, and other benchmarks demonstrate strong reconstruction quality, robust generalization, and superior efficiency relative to GAN-, DDPM-, and flow-matching-based baselines. These results highlight the practicality of restoration-inspired diffusion for high-fidelity, fast neural vocoding in real-time systems.

Abstract

This paper revisits the neural vocoder task through the lens of audio restoration and propose a novel diffusion vocoder called BridgeVoC. Specifically, by rank analysis, we compare the rank characteristics of Mel-spectrum with other common acoustic degradation factors, and cast the vocoder task as a specialized case of audio restoration, where the range-space spectral (RSS) surrogate of the target spectrum acts as the degraded input. Based on that, we introduce the Schrodinger bridge framework for diffusion modeling, which defines the RSS and target spectrum as dual endpoints of the stochastic generation trajectory. Further, to fully utilize the hierarchical prior of subbands in the time-frequency (T-F) domain, we elaborately devise a novel subband-aware convolutional diffusion network as the data predictor, where subbands are divided following an uneven strategy, and convolutional-style attention module is employed with large kernels for efficient T-F contextual modeling. To enable single-step inference, we propose an omnidirectional distillation loss to facilitate effective information transfer from the teacher model to the student model, and the performance is improved by combining target-related and bijective consistency losses. Comprehensive experiments are conducted on various benchmarks and out-of-distribution datasets. Quantitative and qualitative results show that while enjoying fewer parameters, lower computational cost, and competitive inference speed, the proposed BridgeVoC yields stateof-the-art performance over existing advanced GAN-, DDPMand flow-matching-based baselines with only 4 sampling steps. And consistent superiority is still achieved with single-step inference.

BridgeVoC: Revitalizing Neural Vocoder from a Restoration Perspective

TL;DR

BridgeVoC reframes neural vocoding as an audio restoration problem by exploiting a range-space spectral (RSS) surrogate of the Mel-spectrum and connects source and target via a Schrödinger bridge to reduce the diffusion trajectory. It introduces a subband-aware diffusion network (BCD) with uneven subband divisions and a large-kernel convolutional attention module (LKCAM) to model time-frequency dependencies efficiently. A novel omnidirectional distillation loss enables effective single-step generation, supplemented by target-related and bijective consistency losses, yielding state-of-the-art results with as few as 4 inference steps and successful single-step distillation. Extensive experiments on LibriTTS, LibriTTS-out-of-distribution, and other benchmarks demonstrate strong reconstruction quality, robust generalization, and superior efficiency relative to GAN-, DDPM-, and flow-matching-based baselines. These results highlight the practicality of restoration-inspired diffusion for high-fidelity, fast neural vocoding in real-time systems.

Abstract

This paper revisits the neural vocoder task through the lens of audio restoration and propose a novel diffusion vocoder called BridgeVoC. Specifically, by rank analysis, we compare the rank characteristics of Mel-spectrum with other common acoustic degradation factors, and cast the vocoder task as a specialized case of audio restoration, where the range-space spectral (RSS) surrogate of the target spectrum acts as the degraded input. Based on that, we introduce the Schrodinger bridge framework for diffusion modeling, which defines the RSS and target spectrum as dual endpoints of the stochastic generation trajectory. Further, to fully utilize the hierarchical prior of subbands in the time-frequency (T-F) domain, we elaborately devise a novel subband-aware convolutional diffusion network as the data predictor, where subbands are divided following an uneven strategy, and convolutional-style attention module is employed with large kernels for efficient T-F contextual modeling. To enable single-step inference, we propose an omnidirectional distillation loss to facilitate effective information transfer from the teacher model to the student model, and the performance is improved by combining target-related and bijective consistency losses. Comprehensive experiments are conducted on various benchmarks and out-of-distribution datasets. Quantitative and qualitative results show that while enjoying fewer parameters, lower computational cost, and competitive inference speed, the proposed BridgeVoC yields stateof-the-art performance over existing advanced GAN-, DDPMand flow-matching-based baselines with only 4 sampling steps. And consistent superiority is still achieved with single-step inference.

Paper Structure

This paper contains 31 sections, 41 equations, 16 figures, 13 tables, 1 algorithm.

Figures (16)

  • Figure 1: Illustrations of different vocoder paradigms. (a) Previous non-diffusion paradigms (e.g., autoregressive van2016wavenet, flow prenger2019waveglow, and GAN kumar2019melgan-based models), where the generator acts as a black-box to directly model the distribution of target waveforms in the time domain or spectrograms in the T-F domain. (b) Previous diffusion paradigms (e.g., DDPM ho2020denoising, Flow Matching tong2023conditional models), where the target distribution is gradually modeled starting from Gaussian noise via a reverse process. (c) The proposed BridgeVoC, where the degradation prior of the Mel-spectrum is fully exploited, and the acoustic range-space representation serves as the source distribution.
  • Figure 2: PESQ scores versus the number of function evaluations (NFE) for GAN-based and diffusion-based approaches on the LibriTTS benchmark. $\left(\cdot\right)^{\star}$ indicates that the single-step sampling strategy is adopted for the diffusion model. A larger bubble/star denotes higher computational complexity.
  • Figure 3: Overall structure of the previously adopted NCSN++ model. $C$ denotes the original number of feature channels.
  • Figure 4: Framework of the proposed BridgeVoC, where BCD serves as the data predictor. (a) Overall forward and reverse process architecture. (b) Details of range-space spectral surrogate. (c) Internal structure of the proposed convolutional-style subband-division module (CSBD). (d) Internal structure of the proposed large-kernel convolutional attention module (LKCAM), compoased of stacked LKCABs. (e) Internal structure of the proposed convolutional-style band-merge module (CSBM). (f) Internal structure of the adopted LKCAB, which includes a convolutional attention block (CAB) and a convolutional feed-forward network (ConvFFN). A conditional time embedding $\mathbf{E}_{t}$ is introduced for feature modulation at each iterative step.
  • Figure 5: Diagram of the proposed single-step distillation scheme. (a) The single-step student model learns the deterministic mapping function from the teacher diffusion model, which involves two types of losses: distillation-related and ground-truth (GT)-related. (b) Detail of the naïve distillation loss. (c) Detail of the proposed omnidirectional distillation loss. (d) Detail of the bijective consistency loss, including two bijective mappings, namely source-to-source (S2S) and target-to-target (T2T).
  • ...and 11 more figures