Electrolaryngeal Speech Intelligibility Enhancement Through Robust Linguistic Encoders
Lester Phillip Violeta, Wen-Chin Huang, Ding Ma, Ryuichi Yamamoto, Kazuhiro Kobayashi, Tomoki Toda
TL;DR
This work addresses intelligibility enhancement for electrolaryngeal speech by introducing a robust linguistic encoder that unifies EL and typical speech representations, thereby reducing speech-type mismatch in pretraining and fine-tuning. The framework comprises recognition, alignment, and synthesis modules, with a loss L_{ASR} = L_{SID}(X) + L_{CTC}(X_{EL}) + L_{ATTN}(X_{EL}) guiding the recognition module to extract pure linguistic information. By feeding BNFs into alignment and using HuBERT soft features as targets, and employing a diffusion-based synthesis decoder conditioned on speaker embeddings and HuBERT features, the method achieves substantial improvements (e.g., 16% CER reduction and 0.83 MOS increase) over mel-based baselines. The combination of robust linguistic encoding, HuBERT features, diffusion-based generation, and parallel VC pretraining yields higher intelligibility and more natural-sounding speech, advancing practical EL-to-typical voice conversion for real-world communication.
Abstract
We propose a novel framework for electrolaryngeal speech intelligibility enhancement through the use of robust linguistic encoders. Pretraining and fine-tuning approaches have proven to work well in this task, but in most cases, various mismatches, such as the speech type mismatch (electrolaryngeal vs. typical) or a speaker mismatch between the datasets used in each stage, can deteriorate the conversion performance of this framework. To resolve this issue, we propose a linguistic encoder robust enough to project both EL and typical speech in the same latent space, while still being able to extract accurate linguistic information, creating a unified representation to reduce the speech type mismatch. Furthermore, we introduce HuBERT output features to the proposed framework for reducing the speaker mismatch, making it possible to effectively use a large-scale parallel dataset during pretraining. We show that compared to the conventional framework using mel-spectrogram input and output features, using the proposed framework enables the model to synthesize more intelligible and naturally sounding speech, as shown by a significant 16% improvement in character error rate and 0.83 improvement in naturalness score.
