Table of Contents
Fetching ...

Error-Resilient Semantic Communication for Speech Transmission over Packet-Loss Networks

Zhuohang Han, Jincheng Dai, Shengshi Yao, Junyi Wang, Yanlong Li, Kai Niu, Wenjun Xu, Ping Zhang

TL;DR

Glaris presents a standard-compatible semantic communication framework for real-time speech over packet-loss networks by leveraging generative latent priors in a two-stage coding scheme. It combines VQ-VAE–based latent representations with error-resilient transform coding, guided by a hyperprior, and integrates side-information–based PLC and in-band FEC for robust recovery under burst losses. The approach achieves superior rate-distortion performance and perceptual quality compared with both traditional codecs and neural baselines, while maintaining real-time streaming capability. Its tunable redundancy mechanism enables flexible adaptation to diverse channel conditions, offering a practical path toward JSCC-level robustness within separable system architectures.

Abstract

Real-time speech communication over wireless networks remains challenging, as conventional channel protection mechanisms cannot effectively counter packet loss under stringent bandwidth and latency constraints. Semantic communication has emerged as a promising paradigm for enhancing the robustness of speech transmission by means of joint source-channel coding (JSCC). However, its cross-layer design hinders practical deployment due to the incompatibility with existing digital communication systems. In this case, the robustness of speech communication is consequently evaluated primarily by the error-resilience to packet loss over wireless networks. To address these challenges, we propose \emph{Glaris}, a generative latent-prior-based resilient speech semantic communication framework that performs resilient speech coding in the generative latent space. Generative latent priors enable high-quality packet loss concealment (PLC) at the receiver side, well-balancing semantic consistency and reconstruction fidelity. Additionally, an integrated error resilience mechanism is designed to mitigate the error propagation and improve the effectiveness of PLC. Compared with traditional packet-level forward error correction (FEC) strategies, our new method achieves enhanced robustness over dynamic wireless networks while reducing redundancy overhead significantly. Experimental results on the LibriSpeech dataset demonstrate that \emph{Glaris} consistently outperforms existing error-resilient codecs, achieving JSCC-level robustness while maintaining seamless compatibility with existing systems, and it also strikes a favorable balance between transmission efficiency and speech reconstruction quality.

Error-Resilient Semantic Communication for Speech Transmission over Packet-Loss Networks

TL;DR

Glaris presents a standard-compatible semantic communication framework for real-time speech over packet-loss networks by leveraging generative latent priors in a two-stage coding scheme. It combines VQ-VAE–based latent representations with error-resilient transform coding, guided by a hyperprior, and integrates side-information–based PLC and in-band FEC for robust recovery under burst losses. The approach achieves superior rate-distortion performance and perceptual quality compared with both traditional codecs and neural baselines, while maintaining real-time streaming capability. Its tunable redundancy mechanism enables flexible adaptation to diverse channel conditions, offering a practical path toward JSCC-level robustness within separable system architectures.

Abstract

Real-time speech communication over wireless networks remains challenging, as conventional channel protection mechanisms cannot effectively counter packet loss under stringent bandwidth and latency constraints. Semantic communication has emerged as a promising paradigm for enhancing the robustness of speech transmission by means of joint source-channel coding (JSCC). However, its cross-layer design hinders practical deployment due to the incompatibility with existing digital communication systems. In this case, the robustness of speech communication is consequently evaluated primarily by the error-resilience to packet loss over wireless networks. To address these challenges, we propose \emph{Glaris}, a generative latent-prior-based resilient speech semantic communication framework that performs resilient speech coding in the generative latent space. Generative latent priors enable high-quality packet loss concealment (PLC) at the receiver side, well-balancing semantic consistency and reconstruction fidelity. Additionally, an integrated error resilience mechanism is designed to mitigate the error propagation and improve the effectiveness of PLC. Compared with traditional packet-level forward error correction (FEC) strategies, our new method achieves enhanced robustness over dynamic wireless networks while reducing redundancy overhead significantly. Experimental results on the LibriSpeech dataset demonstrate that \emph{Glaris} consistently outperforms existing error-resilient codecs, achieving JSCC-level robustness while maintaining seamless compatibility with existing systems, and it also strikes a favorable balance between transmission efficiency and speech reconstruction quality.

Paper Structure

This paper contains 38 sections, 12 equations, 13 figures, 6 tables.

Figures (13)

  • Figure 1: Proposed error-resilient speech communication framework using generative latent priors. Top: Overview of the two-stage coding framework, where a VQ-VAE first learns a generative latent representation that preserves fine-grained acoustic features, and the subsequent transform coding stage operates on the generative latent space for RD optimization and error resilience. Latent prior loss and hyperprior guidance further enhance semantic consistency for compression and error resilience. Bottom left: Proposed side-information-based PLC. Bottom right: Side information reused as in-band FEC.
  • Figure 2: Illustration of the proposed error-resilient latent transform coding with side-information-based PLC, where the hyperprior is compressed by RVQ. A dual-function entropy model utilizes the decoded side information for both entropy decoding and PLC. In error-free transmission, the entropy module predicts the distribution parameters for entropy decoding, whereas under packet loss, the PLC module reconstructs the missing latent frame using the received side information.
  • Figure 3: Illustration of the side-information-based PLC. The decoding path is selected according to whether the current latent $\hat{\bm{y}}$ can be successfully entropy decoded, as reflected by the color of the arrows. To alleviate context loss, the context length of both the entropy model and the transformer $h_s$ is restricted. When the side information $\bm{z}$ is unavailable, a learned mask token serves as its substitute. In the latent $\bm{y}$ space, another learned mask token is added to the predicted $\hat{\bm{y}}$ to represent the confidence of the reconstruction. Through this hierarchical process, the reconstructed latent representation $\hat{\bm{l}}$ integrates side information and inter-frame dependencies.
  • Figure 4: Learning process of the side information. The entropy module predicts the distribution of $\bm{y}$ for entropy coding, while the PLC module reconstructs $\bm{y}$ under MSE supervision. To constrain bitrate and enable rate control, the side information $\bm{z}$ is compressed with RVQ. During training, random masking with learned tokens $\mathcal{M}$ is applied to $\bm{z}$ at a ratio between 0 and 0.1 to improve robustness against missing side information.
  • Figure 5: Magnitude spectrograms (in dB) of an example speech utterance under different PLC strategies. (a) Reconstructed source signal without packet loss. (PESQ = 4.13, PLCMOS = 4.34, STOI = 0.99) (b) Zero filling in source signal $\hat{\bm{x}}$, where zero filling position are masked for visualization. (PESQ = 1.46, PLCMOS = 2.33, STOI = 0.82) (c) Zero filling in encoded signal $\hat{\bm{y}}$. (PESQ = 1.72, PLCMOS = 3.08, STOI = 0.87) (d) Zero filling in side information $\bm{z}^Q$. (PESQ = 1.88, PLCMOS = 3.05, STOI = 0.89) (e) Sampling from the predicted distribution. (PESQ = 2.30, PLCMOS = 3.69, STOI = 0.93) (f) Proposed side-information-based PLC. (PESQ = 3.00, PLCMOS = 4.11, STOI = 0.96). Compared with (b)--(d), (e) and (f) exhibit richer spectral details and higher perceptual scores, demonstrating the effectiveness of leveraging side information for in-band FEC.
  • ...and 8 more figures