From Hallucination to Articulation: Language Model-Driven Losses for Ultra Low-Bitrate Neural Speech Coding
Jayeon Yi, Minje Kim
TL;DR
This work tackles phoneme hallucination in ultra-low-bitrate neural speech codecs by introducing language model-driven losses that inject semantic guidance without changing codec architecture. It develops two loss families, ASR loss and TTR loss, which leverage pretrained ASR and text-audio alignment models (e.g., Whisper, WavLM, BERT) to steer decoding toward linguistically plausible outputs. Experiments on a HuBERT-Hifi-GAN–based reference codec with LJ Speech show that LM losses boost semantic adherence and reduce PHs, with ASR loss providing the strongest phonetic preservation, though overall quality remains competitive with semantic distillation baselines. The proposed end-to-end losses extend the semantic-acoustic tradeoff and are applicable to any DNN-based codec, offering a practical path to improve semantic fidelity at very low bitrates.
Abstract
``Phoneme Hallucinations (PH)'' commonly occur in low-bitrate DNN-based codecs. It is the generative decoder's attempt to synthesize plausible outputs from excessively compressed tokens missing some semantic information. In this work, we propose language model-driven losses (LM loss) and show they may alleviate PHs better than a semantic distillation (SD) objective in very-low-bitrate settings. The proposed LM losses build upon language models pretrained to associate speech with text. When ground-truth transcripts are unavailable, we propose to modify a popular automatic speech recognition (ASR) model, Whisper, to compare the decoded utterance against the ASR-inferred transcriptions of the input speech. Else, we propose to use the timed-text regularizer (TTR) to compare WavLM representations of the decoded utterance against BERT representations of the ground-truth transcriptions. We test and compare LM losses against an SD objective, using a reference codec whose three-stage training regimen was designed after several popular codecs. Subjective and objective evaluations conclude that LM losses may provide stronger guidance to extract semantic information from self-supervised speech representations, boosting human-perceived semantic adherence while preserving overall output quality. Demo samples, code, and checkpoints are available online.
