A Mel Spectrogram Enhancement Paradigm Based on CWT in Speech Synthesis

Guoqiang Hu; Huaning Tan; Ruilai Li

A Mel Spectrogram Enhancement Paradigm Based on CWT in Speech Synthesis

Guoqiang Hu, Huaning Tan, Ruilai Li

TL;DR

The paper addresses the limited time–frequency resolution of Mel spectrograms in neural TTS caused by Fourier-based processing. It proposes a Mel spectrogram enhancement paradigm that adds an auxiliary wavelet-based task, predicting a finer wavelet spectrogram from the Mel spectrogram decoder output, and optimizes with a multi-task loss $Loss_{Total} = Loss_{Baseline} + Loss_{Wavelet}$. The approach is validated on Tacotron2 (AR) and Fastspeech2 (NAR) using LJ Speech, achieving MOS improvements of $0.14$ and $0.09$, respectively. Results indicate the paradigm enhances spectral detail across architectures, suggesting broad applicability to Mel-based TTS pipelines and motivating extensions to other acoustic representations.

Abstract

Acoustic features play an important role in improving the quality of the synthesised speech. Currently, the Mel spectrogram is a widely employed acoustic feature in most acoustic models. However, due to the fine-grained loss caused by its Fourier transform process, the clarity of speech synthesised by Mel spectrogram is compromised in mutant signals. In order to obtain a more detailed Mel spectrogram, we propose a Mel spectrogram enhancement paradigm based on the continuous wavelet transform (CWT). This paradigm introduces an additional task: a more detailed wavelet spectrogram, which like the post-processing network takes as input the Mel spectrogram output by the decoder. We choose Tacotron2 and Fastspeech2 for experimental validation in order to test autoregressive (AR) and non-autoregressive (NAR) speech systems, respectively. The experimental results demonstrate that the speech synthesised using the model with the Mel spectrogram enhancement paradigm exhibits higher MOS, with an improvement of 0.14 and 0.09 compared to the baseline model, respectively. These findings provide some validation for the universality of the enhancement paradigm, as they demonstrate the success of the paradigm in different architectures.

A Mel Spectrogram Enhancement Paradigm Based on CWT in Speech Synthesis

TL;DR

. The approach is validated on Tacotron2 (AR) and Fastspeech2 (NAR) using LJ Speech, achieving MOS improvements of

and

, respectively. Results indicate the paradigm enhances spectral detail across architectures, suggesting broad applicability to Mel-based TTS pipelines and motivating extensions to other acoustic representations.

Abstract

Paper Structure (13 sections, 4 equations, 3 figures, 1 table)

This paper contains 13 sections, 4 equations, 3 figures, 1 table.

Introduction
Related Work
Text-to-Speech
Acoustic Model
Mel Spectrogram
Tacotron2 and Fastspeech2
Analysis of CWT and Fourier Transform
Proposed Approach
CWT-Net
Experiments and Results
Dataset and Preprocessing
Result Analysis
Summary

Figures (3)

Figure 1: Mel Spectrogram Enhancement Paradigm Framework
Figure 2: Tacotron2 using Mel Spectrogram Enhancement Paradigm
Figure 3: Fastspeech2 using Mel Spectrogram Enhancement Paradigm

A Mel Spectrogram Enhancement Paradigm Based on CWT in Speech Synthesis

TL;DR

Abstract

A Mel Spectrogram Enhancement Paradigm Based on CWT in Speech Synthesis

Authors

TL;DR

Abstract

Table of Contents

Figures (3)