Table of Contents
Fetching ...

From Hallucination to Articulation: Language Model-Driven Losses for Ultra Low-Bitrate Neural Speech Coding

Jayeon Yi, Minje Kim

TL;DR

This work tackles phoneme hallucination in ultra-low-bitrate neural speech codecs by introducing language model-driven losses that inject semantic guidance without changing codec architecture. It develops two loss families, ASR loss and TTR loss, which leverage pretrained ASR and text-audio alignment models (e.g., Whisper, WavLM, BERT) to steer decoding toward linguistically plausible outputs. Experiments on a HuBERT-Hifi-GAN–based reference codec with LJ Speech show that LM losses boost semantic adherence and reduce PHs, with ASR loss providing the strongest phonetic preservation, though overall quality remains competitive with semantic distillation baselines. The proposed end-to-end losses extend the semantic-acoustic tradeoff and are applicable to any DNN-based codec, offering a practical path to improve semantic fidelity at very low bitrates.

Abstract

``Phoneme Hallucinations (PH)'' commonly occur in low-bitrate DNN-based codecs. It is the generative decoder's attempt to synthesize plausible outputs from excessively compressed tokens missing some semantic information. In this work, we propose language model-driven losses (LM loss) and show they may alleviate PHs better than a semantic distillation (SD) objective in very-low-bitrate settings. The proposed LM losses build upon language models pretrained to associate speech with text. When ground-truth transcripts are unavailable, we propose to modify a popular automatic speech recognition (ASR) model, Whisper, to compare the decoded utterance against the ASR-inferred transcriptions of the input speech. Else, we propose to use the timed-text regularizer (TTR) to compare WavLM representations of the decoded utterance against BERT representations of the ground-truth transcriptions. We test and compare LM losses against an SD objective, using a reference codec whose three-stage training regimen was designed after several popular codecs. Subjective and objective evaluations conclude that LM losses may provide stronger guidance to extract semantic information from self-supervised speech representations, boosting human-perceived semantic adherence while preserving overall output quality. Demo samples, code, and checkpoints are available online.

From Hallucination to Articulation: Language Model-Driven Losses for Ultra Low-Bitrate Neural Speech Coding

TL;DR

This work tackles phoneme hallucination in ultra-low-bitrate neural speech codecs by introducing language model-driven losses that inject semantic guidance without changing codec architecture. It develops two loss families, ASR loss and TTR loss, which leverage pretrained ASR and text-audio alignment models (e.g., Whisper, WavLM, BERT) to steer decoding toward linguistically plausible outputs. Experiments on a HuBERT-Hifi-GAN–based reference codec with LJ Speech show that LM losses boost semantic adherence and reduce PHs, with ASR loss providing the strongest phonetic preservation, though overall quality remains competitive with semantic distillation baselines. The proposed end-to-end losses extend the semantic-acoustic tradeoff and are applicable to any DNN-based codec, offering a practical path to improve semantic fidelity at very low bitrates.

Abstract

``Phoneme Hallucinations (PH)'' commonly occur in low-bitrate DNN-based codecs. It is the generative decoder's attempt to synthesize plausible outputs from excessively compressed tokens missing some semantic information. In this work, we propose language model-driven losses (LM loss) and show they may alleviate PHs better than a semantic distillation (SD) objective in very-low-bitrate settings. The proposed LM losses build upon language models pretrained to associate speech with text. When ground-truth transcripts are unavailable, we propose to modify a popular automatic speech recognition (ASR) model, Whisper, to compare the decoded utterance against the ASR-inferred transcriptions of the input speech. Else, we propose to use the timed-text regularizer (TTR) to compare WavLM representations of the decoded utterance against BERT representations of the ground-truth transcriptions. We test and compare LM losses against an SD objective, using a reference codec whose three-stage training regimen was designed after several popular codecs. Subjective and objective evaluations conclude that LM losses may provide stronger guidance to extract semantic information from self-supervised speech representations, boosting human-perceived semantic adherence while preserving overall output quality. Demo samples, code, and checkpoints are available online.
Paper Structure (14 sections, 3 equations, 3 figures, 1 table)

This paper contains 14 sections, 3 equations, 3 figures, 1 table.

Figures (3)

  • Figure 1: (Top) The input speech. (Bottom) The 187.5 bps reference codec with no LM losses exhibits phoneme hallucinations (PHs).
  • Figure 2: Architecture and training of our reference codec. Our three-stage training procedure emulates common codec-training setups. denotes the modules updated in the given stage, while represents frozen ones. does not participate in. For example, means that the module is updated in the first stage, while being frozen for the second stage, and then being idle in the third stage. In the third stage, either a LM loss ($\mathcal{L}_{\text{ASR}}$,$\mathcal{L}_{\text{TTR}}$) or the SD loss ($\mathcal{L}_{\text{HuBERT}}$) is employed in combination with the others.
  • Figure 3: Overall similarity (left) and the semantic 7-point MOS (right) subjective evaluations; mean and 95% confidence intervals. Codecs trained with LM losses show significantly better semantic performance compared to the others.