Linguistic Knowledge Transfer Learning for Speech Enhancement
Kuo-Hsuan Hung, Xugang Lu, Szu-Wei Fu, Huan-Hsin Tseng, Hsin-Yi Lin, Chii-Wann Lin, Yu Tsao
TL;DR
CMKT addresses the challenge of incorporating linguistic knowledge into speech enhancement without requiring text input at inference. It introduces a cross-modality transformer that injects LLM-derived linguistic embeddings into SE models during training, augmented by a misalignment strategy to enhance robustness, and optimizes a combined loss $\mathcal{L}_{Total} = \alpha\mathcal{L}_{MAE} + (1-\alpha)\mathcal{L}_{CMA}$ with $\mathcal{L}_{CMA} = \sum_{t=1}^{T_t} (1 - \cos(z_t, \hat{z}_t))$. Experiments on Mandarin AISHELL-1 and English LibriSpeech show consistent SE improvements across multiple architectures and LLMs, and the approach remains effective even when textual data is unavailable or only noisy ASR transcriptions are used, highlighting its practicality. By bridging linguistic and acoustic modalities, CMKT offers a scalable method to improve intelligibility and enhancement performance in real-world noisy environments, with potential applicability to other speech tasks.
Abstract
Linguistic knowledge plays a crucial role in spoken language comprehension. It provides essential semantic and syntactic context for speech perception in noisy environments. However, most speech enhancement (SE) methods predominantly rely on acoustic features to learn the mapping relationship between noisy and clean speech, with limited exploration of linguistic integration. While text-informed SE approaches have been investigated, they often require explicit speech-text alignment or externally provided textual data, constraining their practicality in real-world scenarios. Additionally, using text as input poses challenges in aligning linguistic and acoustic representations due to their inherent differences. In this study, we propose the Cross-Modality Knowledge Transfer (CMKT) learning framework, which leverages pre-trained large language models (LLMs) to infuse linguistic knowledge into SE models without requiring text input or LLMs during inference. Furthermore, we introduce a misalignment strategy to improve knowledge transfer. This strategy applies controlled temporal shifts, encouraging the model to learn more robust representations. Experimental evaluations demonstrate that CMKT consistently outperforms baseline models across various SE architectures and LLM embeddings, highlighting its adaptability to different configurations. Additionally, results on Mandarin and English datasets confirm its effectiveness across diverse linguistic conditions, further validating its robustness. Moreover, CMKT remains effective even in scenarios without textual data, underscoring its practicality for real-world applications. By bridging the gap between linguistic and acoustic modalities, CMKT offers a scalable and innovative solution for integrating linguistic knowledge into SE models, leading to substantial improvements in both intelligibility and enhancement performance.
