Table of Contents
Fetching ...

Linguistic Knowledge Transfer Learning for Speech Enhancement

Kuo-Hsuan Hung, Xugang Lu, Szu-Wei Fu, Huan-Hsin Tseng, Hsin-Yi Lin, Chii-Wann Lin, Yu Tsao

TL;DR

CMKT addresses the challenge of incorporating linguistic knowledge into speech enhancement without requiring text input at inference. It introduces a cross-modality transformer that injects LLM-derived linguistic embeddings into SE models during training, augmented by a misalignment strategy to enhance robustness, and optimizes a combined loss $\mathcal{L}_{Total} = \alpha\mathcal{L}_{MAE} + (1-\alpha)\mathcal{L}_{CMA}$ with $\mathcal{L}_{CMA} = \sum_{t=1}^{T_t} (1 - \cos(z_t, \hat{z}_t))$. Experiments on Mandarin AISHELL-1 and English LibriSpeech show consistent SE improvements across multiple architectures and LLMs, and the approach remains effective even when textual data is unavailable or only noisy ASR transcriptions are used, highlighting its practicality. By bridging linguistic and acoustic modalities, CMKT offers a scalable method to improve intelligibility and enhancement performance in real-world noisy environments, with potential applicability to other speech tasks.

Abstract

Linguistic knowledge plays a crucial role in spoken language comprehension. It provides essential semantic and syntactic context for speech perception in noisy environments. However, most speech enhancement (SE) methods predominantly rely on acoustic features to learn the mapping relationship between noisy and clean speech, with limited exploration of linguistic integration. While text-informed SE approaches have been investigated, they often require explicit speech-text alignment or externally provided textual data, constraining their practicality in real-world scenarios. Additionally, using text as input poses challenges in aligning linguistic and acoustic representations due to their inherent differences. In this study, we propose the Cross-Modality Knowledge Transfer (CMKT) learning framework, which leverages pre-trained large language models (LLMs) to infuse linguistic knowledge into SE models without requiring text input or LLMs during inference. Furthermore, we introduce a misalignment strategy to improve knowledge transfer. This strategy applies controlled temporal shifts, encouraging the model to learn more robust representations. Experimental evaluations demonstrate that CMKT consistently outperforms baseline models across various SE architectures and LLM embeddings, highlighting its adaptability to different configurations. Additionally, results on Mandarin and English datasets confirm its effectiveness across diverse linguistic conditions, further validating its robustness. Moreover, CMKT remains effective even in scenarios without textual data, underscoring its practicality for real-world applications. By bridging the gap between linguistic and acoustic modalities, CMKT offers a scalable and innovative solution for integrating linguistic knowledge into SE models, leading to substantial improvements in both intelligibility and enhancement performance.

Linguistic Knowledge Transfer Learning for Speech Enhancement

TL;DR

CMKT addresses the challenge of incorporating linguistic knowledge into speech enhancement without requiring text input at inference. It introduces a cross-modality transformer that injects LLM-derived linguistic embeddings into SE models during training, augmented by a misalignment strategy to enhance robustness, and optimizes a combined loss with . Experiments on Mandarin AISHELL-1 and English LibriSpeech show consistent SE improvements across multiple architectures and LLMs, and the approach remains effective even when textual data is unavailable or only noisy ASR transcriptions are used, highlighting its practicality. By bridging linguistic and acoustic modalities, CMKT offers a scalable method to improve intelligibility and enhancement performance in real-world noisy environments, with potential applicability to other speech tasks.

Abstract

Linguistic knowledge plays a crucial role in spoken language comprehension. It provides essential semantic and syntactic context for speech perception in noisy environments. However, most speech enhancement (SE) methods predominantly rely on acoustic features to learn the mapping relationship between noisy and clean speech, with limited exploration of linguistic integration. While text-informed SE approaches have been investigated, they often require explicit speech-text alignment or externally provided textual data, constraining their practicality in real-world scenarios. Additionally, using text as input poses challenges in aligning linguistic and acoustic representations due to their inherent differences. In this study, we propose the Cross-Modality Knowledge Transfer (CMKT) learning framework, which leverages pre-trained large language models (LLMs) to infuse linguistic knowledge into SE models without requiring text input or LLMs during inference. Furthermore, we introduce a misalignment strategy to improve knowledge transfer. This strategy applies controlled temporal shifts, encouraging the model to learn more robust representations. Experimental evaluations demonstrate that CMKT consistently outperforms baseline models across various SE architectures and LLM embeddings, highlighting its adaptability to different configurations. Additionally, results on Mandarin and English datasets confirm its effectiveness across diverse linguistic conditions, further validating its robustness. Moreover, CMKT remains effective even in scenarios without textual data, underscoring its practicality for real-world applications. By bridging the gap between linguistic and acoustic modalities, CMKT offers a scalable and innovative solution for integrating linguistic knowledge into SE models, leading to substantial improvements in both intelligibility and enhancement performance.

Paper Structure

This paper contains 27 sections, 8 equations, 6 figures, 5 tables.

Figures (6)

  • Figure 1: Comparative flowchart between conventional SE and the proposed CMKT learning.
  • Figure 2: The proposed model architecture diagram. The left branch represents the baseline SE model, while the right branch represents the Text-Speech integration module. Where $\odot$ and $\oplus$ denote as element-wise multiplication and element-wise addition, respectively.
  • Figure 3: Illustration of the misalignment strategy. The green dashed line, blue dashed line, and red dashed line represent the embeddings being aligned, right-shifted, or left-shifted during the cross-modal alignment loss calculation.
  • Figure 4: The attention weighting in the cross-modality module, where the x-axis and y-axis represent the speech and text modalities, respectively. Results for different models under various alignment methods are presented: (a)–(c) correspond to aligned, (d)–(f) to left-shifted, and (g)–(i) to right-shifted alignments.
  • Figure 5: The evaluation scores at different $\alpha$ values (0.1 to 0.9 in increments of 0.2) on the AISHELL-1 dataset. The blue line represents the baseline Conformer model without CMKT.
  • ...and 1 more figures