Pseudo-Cepstrum: Pitch Modification for Mel-Based Neural Vocoders
Nikolaos Ellinas, Alexandra Vioni, Panos Kakoulidis, Georgios Vamvoukakis, Myrsini Christidou, Konstantinos Markopoulos, Junkwang Oh, Gunu Jho, Inchul Hwang, Aimilios Chalamandaris, Pirros Tsiakoulis
TL;DR
The paper addresses pitch controllability for mel-based neural vocoders by introducing a DSP-based, F0-free pitch modification method that operates in the cepstral domain. It defines a Mel pseudo-cepstrum by applying the mel-filterbank pseudo-inverse and a DCT, then shifts the cepstral peak with a non-linear interpolation before reconstructing the modified mel-spectrogram. The approach is model-agnostic and lightweight, validated across multiple vocoders with both objective pitch-tracking metrics and subjective MOS tests, showing competitive performance with traditional methods. This method enables easy, training-free pitch control for TTS and VC systems and suggests future avenues for disentangled representations of vocal tract and pitch.
Abstract
This paper introduces a cepstrum-based pitch modification method that can be applied to any mel-spectrogram representation. As a result, this method is compatible with any mel-based vocoder without requiring any additional training or changes to the model. This is achieved by directly modifying the cepstrum feature space in order to shift the harmonic structure to the desired target. The spectrogram magnitude is computed via the pseudo-inverse mel transform, then converted to the cepstrum by applying DCT. In this domain, the cepstral peak is shifted without having to estimate its position and the modified mel is recomputed by applying IDCT and mel-filterbank. These pitch-shifted mel-spectrogram features can be converted to speech with any compatible vocoder. The proposed method is validated experimentally with objective and subjective metrics on various state-of-the-art neural vocoders as well as in comparison with traditional pitch modification methods.
