Table of Contents
Fetching ...

Pseudo-Cepstrum: Pitch Modification for Mel-Based Neural Vocoders

Nikolaos Ellinas, Alexandra Vioni, Panos Kakoulidis, Georgios Vamvoukakis, Myrsini Christidou, Konstantinos Markopoulos, Junkwang Oh, Gunu Jho, Inchul Hwang, Aimilios Chalamandaris, Pirros Tsiakoulis

TL;DR

The paper addresses pitch controllability for mel-based neural vocoders by introducing a DSP-based, F0-free pitch modification method that operates in the cepstral domain. It defines a Mel pseudo-cepstrum by applying the mel-filterbank pseudo-inverse and a DCT, then shifts the cepstral peak with a non-linear interpolation before reconstructing the modified mel-spectrogram. The approach is model-agnostic and lightweight, validated across multiple vocoders with both objective pitch-tracking metrics and subjective MOS tests, showing competitive performance with traditional methods. This method enables easy, training-free pitch control for TTS and VC systems and suggests future avenues for disentangled representations of vocal tract and pitch.

Abstract

This paper introduces a cepstrum-based pitch modification method that can be applied to any mel-spectrogram representation. As a result, this method is compatible with any mel-based vocoder without requiring any additional training or changes to the model. This is achieved by directly modifying the cepstrum feature space in order to shift the harmonic structure to the desired target. The spectrogram magnitude is computed via the pseudo-inverse mel transform, then converted to the cepstrum by applying DCT. In this domain, the cepstral peak is shifted without having to estimate its position and the modified mel is recomputed by applying IDCT and mel-filterbank. These pitch-shifted mel-spectrogram features can be converted to speech with any compatible vocoder. The proposed method is validated experimentally with objective and subjective metrics on various state-of-the-art neural vocoders as well as in comparison with traditional pitch modification methods.

Pseudo-Cepstrum: Pitch Modification for Mel-Based Neural Vocoders

TL;DR

The paper addresses pitch controllability for mel-based neural vocoders by introducing a DSP-based, F0-free pitch modification method that operates in the cepstral domain. It defines a Mel pseudo-cepstrum by applying the mel-filterbank pseudo-inverse and a DCT, then shifts the cepstral peak with a non-linear interpolation before reconstructing the modified mel-spectrogram. The approach is model-agnostic and lightweight, validated across multiple vocoders with both objective pitch-tracking metrics and subjective MOS tests, showing competitive performance with traditional methods. This method enables easy, training-free pitch control for TTS and VC systems and suggests future avenues for disentangled representations of vocal tract and pitch.

Abstract

This paper introduces a cepstrum-based pitch modification method that can be applied to any mel-spectrogram representation. As a result, this method is compatible with any mel-based vocoder without requiring any additional training or changes to the model. This is achieved by directly modifying the cepstrum feature space in order to shift the harmonic structure to the desired target. The spectrogram magnitude is computed via the pseudo-inverse mel transform, then converted to the cepstrum by applying DCT. In this domain, the cepstral peak is shifted without having to estimate its position and the modified mel is recomputed by applying IDCT and mel-filterbank. These pitch-shifted mel-spectrogram features can be converted to speech with any compatible vocoder. The proposed method is validated experimentally with objective and subjective metrics on various state-of-the-art neural vocoders as well as in comparison with traditional pitch modification methods.

Paper Structure

This paper contains 9 sections, 10 equations, 4 figures.

Figures (4)

  • Figure 1: The proposed pitch modification method on the mel-spectrogram domain. The pseudo-inverse transform followed by DCT, as well as IDCT followed by the mel-filterbank, can be combined in a single linear transformation.
  • Figure 2: Log mel-spectrogram (top) and the corresponding pseudo-cepstrum (bottom) for a voiced speech frame (left) as well as an unvoiced speech frame (right). The pitch shifted versions for $+5$ and $-5$ semitones are also shown.
  • Figure 3: Objective evaluation metrics. The horizontal axis represents the semitone shift value, whereas the vertical axis shows the value of the corresponding metric. The shadowed regions indicate the $95\%$ confidence intervals. Lower values indicate better performance.
  • Figure 4: Subjective listening test results. Horizontal and vertical axes represent semitone shift value and MOS score respectively. The shadowed regions indicate the $95\%$ confidence intervals.