Large Speech Model Enabled Semantic Communication
Yun Tian, Zhijin Qin, Guocheng Lv, Ye Jin, Kaibin Huang, Zhu Han
TL;DR
This work introduces LargeSC, a framework that enables semantic speech transmission at very low bitrates by representing speech as discrete tokens via a Mimi-based coder and protecting important tokens through an adaptive in-band Unequal Error Protection strategy. A large autoregressive speech model (Moshi) is finetuned with LoRA to perform loss concealment and token reconstruction at the receiver, producing robust, real-time capable performance with end-to-end latency around 460 ms. The system is evaluated on LibriSpeech and Common Voice with metrics including VisQOL, PLCMOS, and WER, showing substantial bitrate savings and improved perceptual and semantic quality under packet loss compared to AAC, Opus, and SoundStream baselines. The results demonstrate strong generalization, real-time viability, and practical potential for low-bandwidth, loss-prone speech communication, while also outlining computational challenges and avenues for optimization.
Abstract
Existing speech semantic communication systems mainly based on Joint Source-Channel Coding (JSCC) architectures have demonstrated impressive performance, but their effectiveness remains limited by model structures specifically designed for particular tasks and datasets. Recent advances indicate that generative large models pre-trained on massive datasets, can achieve outstanding performance arexhibit exceptional performance across diverse downstream tasks with minimal fine-tuning. To exploit the rich semantic knowledge embedded in large models and enable adaptive transmission over lossy channels, we propose a Large Speech Model enabled Semantic Communication (LargeSC) system. Simultaneously achieving adaptive compression and robust transmission over lossy channels remains challenging, requiring trade-offs among compression efficiency, speech quality, and latency. In this work, we employ the Mimi as a speech codec, converting speech into discrete tokens compatible with existing network architectures. We propose an adaptive controller module that enables adaptive transmission and in-band Unequal Error Protection (UEP), dynamically adjusting to both speech content and packet loss probability under bandwidth constraints. Additionally, we employ Low-Rank Adaptation (LoRA) to finetune the Moshi foundation model for generative recovery of lost speech tokens. Simulation results show that the proposed system supports bandwidths ranging from 550 bps to 2.06 kbps, outperforms conventional baselines in speech quality under high packet loss rates and achieves an end-to-end latency of approximately 460 ms, thereby demonstrating its potential for real-time deployment.
