Table of Contents
Fetching ...

Toward Complex-Valued Neural Networks for Waveform Generation

Hyung-Seok Oh, Deok-Hyeon Cho, Seung-Bin Kim, Seong-Whan Lee

Abstract

Neural vocoders have recently advanced waveform generation, yielding natural and expressive audio. Among these approaches, iSTFT-based vocoders have recently gained attention. They predict a complex-valued spectrogram and then synthesize the waveform via iSTFT, thereby avoiding learned upsampling stages that can increase computational cost. However, current approaches use real-valued networks that process the real and imaginary parts independently. This separation limits their ability to capture the inherent structure of complex spectrograms. We present ComVo, a Complex-valued neural Vocoder whose generator and discriminator use native complex arithmetic. This enables an adversarial training framework that provides structured feedback in complex-valued representations. To guide phase transformations in a structured manner, we introduce phase quantization, which discretizes phase values and regularizes the training process. Finally, we propose a block-matrix computation scheme to improve training efficiency by reducing redundant operations. Experiments demonstrate that ComVo achieves higher synthesis quality than comparable real-valued baselines, and that its block-matrix scheme reduces training time by 25%. Audio samples and code are available at https://hs-oh-prml.github.io/ComVo/.

Toward Complex-Valued Neural Networks for Waveform Generation

Abstract

Neural vocoders have recently advanced waveform generation, yielding natural and expressive audio. Among these approaches, iSTFT-based vocoders have recently gained attention. They predict a complex-valued spectrogram and then synthesize the waveform via iSTFT, thereby avoiding learned upsampling stages that can increase computational cost. However, current approaches use real-valued networks that process the real and imaginary parts independently. This separation limits their ability to capture the inherent structure of complex spectrograms. We present ComVo, a Complex-valued neural Vocoder whose generator and discriminator use native complex arithmetic. This enables an adversarial training framework that provides structured feedback in complex-valued representations. To guide phase transformations in a structured manner, we introduce phase quantization, which discretizes phase values and regularizes the training process. Finally, we propose a block-matrix computation scheme to improve training efficiency by reducing redundant operations. Experiments demonstrate that ComVo achieves higher synthesis quality than comparable real-valued baselines, and that its block-matrix scheme reduces training time by 25%. Audio samples and code are available at https://hs-oh-prml.github.io/ComVo/.
Paper Structure (39 sections, 30 equations, 14 figures, 20 tables)

This paper contains 39 sections, 30 equations, 14 figures, 20 tables.

Figures (14)

  • Figure 1: Ground-truth distribution compared with samples generated by RVNN and CVNN.
  • Figure 2: Overview of the ComVo architecture.
  • Figure 3: Grad-CAM comparison across generator-discriminator configurations. Each row corresponds to a cMRD sub-discriminator operating at a different STFT resolution (i, ii, iii).
  • Figure 4: Visualizations over multiple training seeds. Each row corresponds to one run and contains five subplots: ground-truth samples, RVNN outputs, CVNN outputs, and the corresponding magnitude and phase distributions. This layout enables a run-to-run comparison of distributional behavior across the two models.
  • Figure 5: Average inference cost as a function of utterance duration.
  • ...and 9 more figures