Table of Contents
Fetching ...

Scalable Neural Vocoder from Range-Null Space Decomposition

Andong Li, Tong Lei, Zhihang Sun, Rilin Chen, Xiaodong Li, Dong Yu, Chengshi Zheng

TL;DR

This paper bridges the connection between the classical range-null decomposition (RND) theory and the vocoder task, where the reconstruction of the target spectrogram is formulated into the superimposition between range-space and null-space and achieves state-ofthe-art performance among existing advanced methods.

Abstract

Although deep neural networks have facilitated significant progress of neural vocoders in recent years, they usually suffer from intrinsic challenges like opaque modeling, inflexible retraining under different input configurations, and parameter-performance trade-off. These inherent hurdles can heavily impede the development of this field. To resolve these problems, in this paper, we propose a novel neural vocoder in the time-frequency (T-F) domain. Specifically, we bridge the connection between the classical range-null decomposition (RND) theory and the vocoder task, where the reconstruction of the target spectrogram is formulated into the superimposition between range-space and null-space. The former aims to project the representation in the original mel-domain into the target linear-scale domain, and the latter can be instantiated via neural networks to further infill the spectral details. To fully leverage the spectrum prior, an elaborate dual-path framework is devised, where the spectrum is hierarchically encoded and decoded, and the cross- and narrow-band modules are leveraged for effectively modeling along sub-band and time dimensions. To enable inference under various configurations, we propose a simple yet effective strategy, which transforms the multi-condition adaption in the inference stage into the data augmentation in the training stage. Comprehensive experiments are conducted on various benchmarks. Quantitative and qualitative results show that while enjoying lightweight network structure and scalable inference paradigm, the proposed framework achieves state-ofthe-art performance among existing advanced methods. Code is available at https://github.com/Andong-Li-speech/RNDVoC.

Scalable Neural Vocoder from Range-Null Space Decomposition

TL;DR

This paper bridges the connection between the classical range-null decomposition (RND) theory and the vocoder task, where the reconstruction of the target spectrogram is formulated into the superimposition between range-space and null-space and achieves state-ofthe-art performance among existing advanced methods.

Abstract

Although deep neural networks have facilitated significant progress of neural vocoders in recent years, they usually suffer from intrinsic challenges like opaque modeling, inflexible retraining under different input configurations, and parameter-performance trade-off. These inherent hurdles can heavily impede the development of this field. To resolve these problems, in this paper, we propose a novel neural vocoder in the time-frequency (T-F) domain. Specifically, we bridge the connection between the classical range-null decomposition (RND) theory and the vocoder task, where the reconstruction of the target spectrogram is formulated into the superimposition between range-space and null-space. The former aims to project the representation in the original mel-domain into the target linear-scale domain, and the latter can be instantiated via neural networks to further infill the spectral details. To fully leverage the spectrum prior, an elaborate dual-path framework is devised, where the spectrum is hierarchically encoded and decoded, and the cross- and narrow-band modules are leveraged for effectively modeling along sub-band and time dimensions. To enable inference under various configurations, we propose a simple yet effective strategy, which transforms the multi-condition adaption in the inference stage into the data augmentation in the training stage. Comprehensive experiments are conducted on various benchmarks. Quantitative and qualitative results show that while enjoying lightweight network structure and scalable inference paradigm, the proposed framework achieves state-ofthe-art performance among existing advanced methods. Code is available at https://github.com/Andong-Li-speech/RNDVoC.
Paper Structure (50 sections, 59 equations, 30 figures, 21 tables, 1 algorithm)

This paper contains 50 sections, 59 equations, 30 figures, 21 tables, 1 algorithm.

Figures (30)

  • Figure 1: Illustrations of the previous and the proposed neural vocoder generation pipelines with range-null decomposition (RND) theory. (a) In previous T-F domain neural vocoders, the mapping relation from mel-spectrogram to the target spectrogram/waveform was designed in a black-box manner. (b) We exploit the linear degradation prior to develop a more transparent generation pipeline, where the range-space module is to transform the acoustic feature in the original mel-scale domain into the target linear-scale domain, and the null-space module is responsible for generating fine-grained spectral details.
  • Figure 2: Comparison of model parameters and PESQ scores between the proposed RNDVoC and other mainstream vocoders on the LibriTTS benchmark. Larger bubbles indicate higher computational complexity. All generators are updated for 1M steps.
  • Figure 3: Illustrations of the proposed RNDVoC.
  • Figure 4: Framework diagram of the proposed RNDVoC, where the range-space module only involves the pseudo-inverse matrix operation and its output will serve as the input of the null-space module. (a) Main framework diagram of the null-space module. (b) Detailed structure of the band-aware spectral encoding module (BAEM). (c) Detailed structure of the band-aware spectral magnitude/phase module (BAMM/BAPM), which shares a similar network structure. (d) Detailed structure of the dual-path block (DPB). Note that the nonshared scheme is adopted in (b)-(c).
  • Figure 5: Illustrations of the spectral encoding and decoding module with the proposed parameter-shared strategy, and here the number of regions $I$ is set to 3 as an example. (a) Spectral encoding process. (b) Spectral decoding process, we only take one branch (magnitude or phase) as an example.
  • ...and 25 more figures