Table of Contents
Fetching ...

Semantic Communications for Speech Recognition

Zhenzi Weng, Zhijin Qin, Geoffrey Ye Li

TL;DR

This work tackles the inefficiency of conventional systems that transmit raw speech data by introducing semantic communications for speech recognition. It proposes DeepSC-SR, an end-to-end neural transceiver that jointly learns a semantic encoder and channel encoder to map speech spectrums to compact text-related features, enabling accurate transcription recovery at the receiver. Training leverages a CTC loss to directly maximize the likelihood of the transcription given the speech input, while a robust variant maintains performance across diverse channels without retraining. Experimental results on LibriSpeech show that DeepSC-SR outperforms traditional speech- and text-based baselines in CER and WER, particularly in low-SNR and dynamic-channel scenarios, underscoring its practical value for bandwidth-limited, real-world deployments where semantic fidelity matters. The work thus advances semantic communication by integrating text-focused semantic extraction with end-to-end channel coding to improve efficiency and resilience in speech recognition tasks.

Abstract

The traditional communications transmit all the source data represented by bits, regardless of the content of source and the semantic information required by the receiver. However, in some applications, the receiver only needs part of the source data that represents critical semantic information, which prompts to transmit the application-related information, especially when bandwidth resources are limited. In this paper, we consider a semantic communication system for speech recognition by designing the transceiver as an end-to-end (E2E) system. Particularly, a deep learning (DL)-enabled semantic communication system, named DeepSC-SR, is developed to learn and extract text-related semantic features at the transmitter, which motivates the system to transmit much less than the source speech data without performance degradation. Moreover, in order to facilitate the proposed DeepSC-SR for dynamic channel environments, we investigate a robust model to cope with various channel environments without requiring retraining. The simulation results demonstrate that our proposed DeepSC-SR outperforms the traditional communication systems in terms of the speech recognition metrics, such as character-error-rate and word-error-rate, and is more robust to channel variations, especially in the low signal-to-noise (SNR) regime.

Semantic Communications for Speech Recognition

TL;DR

This work tackles the inefficiency of conventional systems that transmit raw speech data by introducing semantic communications for speech recognition. It proposes DeepSC-SR, an end-to-end neural transceiver that jointly learns a semantic encoder and channel encoder to map speech spectrums to compact text-related features, enabling accurate transcription recovery at the receiver. Training leverages a CTC loss to directly maximize the likelihood of the transcription given the speech input, while a robust variant maintains performance across diverse channels without retraining. Experimental results on LibriSpeech show that DeepSC-SR outperforms traditional speech- and text-based baselines in CER and WER, particularly in low-SNR and dynamic-channel scenarios, underscoring its practical value for bandwidth-limited, real-world deployments where semantic fidelity matters. The work thus advances semantic communication by integrating text-focused semantic extraction with end-to-end channel coding to improve efficiency and resilience in speech recognition tasks.

Abstract

The traditional communications transmit all the source data represented by bits, regardless of the content of source and the semantic information required by the receiver. However, in some applications, the receiver only needs part of the source data that represents critical semantic information, which prompts to transmit the application-related information, especially when bandwidth resources are limited. In this paper, we consider a semantic communication system for speech recognition by designing the transceiver as an end-to-end (E2E) system. Particularly, a deep learning (DL)-enabled semantic communication system, named DeepSC-SR, is developed to learn and extract text-related semantic features at the transmitter, which motivates the system to transmit much less than the source speech data without performance degradation. Moreover, in order to facilitate the proposed DeepSC-SR for dynamic channel environments, we investigate a robust model to cope with various channel environments without requiring retraining. The simulation results demonstrate that our proposed DeepSC-SR outperforms the traditional communication systems in terms of the speech recognition metrics, such as character-error-rate and word-error-rate, and is more robust to channel variations, especially in the low signal-to-noise (SNR) regime.

Paper Structure

This paper contains 17 sections, 10 equations, 6 figures, 1 table, 2 algorithms.

Figures (6)

  • Figure 1: Model structure of DL-enabled semantic communication system for speech recognition.
  • Figure 2: The proposed system architecture for semantic communication system for speech recognition.
  • Figure 3: An example of the greedy decoder.
  • Figure 4: The training MSE loss versus epoch.
  • Figure 5: CER score versus SNR for the speech transceiver, the text transceiver, the proposed DeepSC-SR.
  • ...and 1 more figures