Table of Contents
Fetching ...

PeriodWave: Multi-Period Flow Matching for High-Fidelity Waveform Generation

Sang-Hoon Lee, Ha-Yeong Choi, Seong-Whan Lee

TL;DR

PeriodWave introduces period-aware flow matching for high-fidelity waveform generation, explicitly disentangling periodic features via a multi-period estimator and a period-conditioned universal estimator that enables parallel inference. It integrates discrete wavelet transform-based high-frequency modeling and FreeU to reduce high-frequency noise, achieving superior Mel-spectrogram reconstruction and TTS performance while dramatically reducing training time compared to GAN-based vocoders. The approach demonstrates strong out-of-distribution robustness and competitive or superior results across single- and multi-speaker TTS and audio with faster or comparable sampling efficiency. Overall, PeriodWave provides a scalable, period-aware framework for universal waveform generation with practical implications for end-to-end and two-stage speech systems, and it releases code and checkpoints for reproducibility.

Abstract

Recently, universal waveform generation tasks have been investigated conditioned on various out-of-distribution scenarios. Although GAN-based methods have shown their strength in fast waveform generation, they are vulnerable to train-inference mismatch scenarios such as two-stage text-to-speech. Meanwhile, diffusion-based models have shown their powerful generative performance in other domains; however, they stay out of the limelight due to slow inference speed in waveform generation tasks. Above all, there is no generator architecture that can explicitly disentangle the natural periodic features of high-resolution waveform signals. In this paper, we propose PeriodWave, a novel universal waveform generation model. First, we introduce a period-aware flow matching estimator that can capture the periodic features of the waveform signal when estimating the vector fields. Additionally, we utilize a multi-period estimator that avoids overlaps to capture different periodic features of waveform signals. Although increasing the number of periods can improve the performance significantly, this requires more computational costs. To reduce this issue, we also propose a single period-conditional universal estimator that can feed-forward parallel by period-wise batch inference. Additionally, we utilize discrete wavelet transform to losslessly disentangle the frequency information of waveform signals for high-frequency modeling, and introduce FreeU to reduce the high-frequency noise for waveform generation. The experimental results demonstrated that our model outperforms the previous models both in Mel-spectrogram reconstruction and text-to-speech tasks. All source code will be available at \url{https://github.com/sh-lee-prml/PeriodWave}.

PeriodWave: Multi-Period Flow Matching for High-Fidelity Waveform Generation

TL;DR

PeriodWave introduces period-aware flow matching for high-fidelity waveform generation, explicitly disentangling periodic features via a multi-period estimator and a period-conditioned universal estimator that enables parallel inference. It integrates discrete wavelet transform-based high-frequency modeling and FreeU to reduce high-frequency noise, achieving superior Mel-spectrogram reconstruction and TTS performance while dramatically reducing training time compared to GAN-based vocoders. The approach demonstrates strong out-of-distribution robustness and competitive or superior results across single- and multi-speaker TTS and audio with faster or comparable sampling efficiency. Overall, PeriodWave provides a scalable, period-aware framework for universal waveform generation with practical implications for end-to-end and two-stage speech systems, and it releases code and checkpoints for reproducibility.

Abstract

Recently, universal waveform generation tasks have been investigated conditioned on various out-of-distribution scenarios. Although GAN-based methods have shown their strength in fast waveform generation, they are vulnerable to train-inference mismatch scenarios such as two-stage text-to-speech. Meanwhile, diffusion-based models have shown their powerful generative performance in other domains; however, they stay out of the limelight due to slow inference speed in waveform generation tasks. Above all, there is no generator architecture that can explicitly disentangle the natural periodic features of high-resolution waveform signals. In this paper, we propose PeriodWave, a novel universal waveform generation model. First, we introduce a period-aware flow matching estimator that can capture the periodic features of the waveform signal when estimating the vector fields. Additionally, we utilize a multi-period estimator that avoids overlaps to capture different periodic features of waveform signals. Although increasing the number of periods can improve the performance significantly, this requires more computational costs. To reduce this issue, we also propose a single period-conditional universal estimator that can feed-forward parallel by period-wise batch inference. Additionally, we utilize discrete wavelet transform to losslessly disentangle the frequency information of waveform signals for high-frequency modeling, and introduce FreeU to reduce the high-frequency noise for waveform generation. The experimental results demonstrated that our model outperforms the previous models both in Mel-spectrogram reconstruction and text-to-speech tasks. All source code will be available at \url{https://github.com/sh-lee-prml/PeriodWave}.
Paper Structure (60 sections, 12 equations, 4 figures, 16 tables)

This paper contains 60 sections, 12 equations, 4 figures, 16 tables.

Figures (4)

  • Figure 1: Waveform generation using conditional flow matching and ODE solver
  • Figure 2: Overall architecture of PeriodWave
  • Figure 3: Architecture of PeriodWave
  • Figure 4: Detailed information on listeners restrictions and task completion interfaces.