Table of Contents
Fetching ...

CLAR: CIF-Localized Alignment for Retrieval-Augmented Speech LLM-Based Contextual ASR

Shangkun Huang, Huan Shen, Wei Zou, Yunzhang Chen

Abstract

Speech LLM-based ASR often struggles with named entities and long-tail words due to strong internal language-model priors. Retrieval-augmented biasing can help, but its effectiveness depends on accurate hotword localization in full-utterance speech under weak supervision. We propose CLAR, a dual-encoder speech-text retriever that uses Continuous Integrate-and-Fire (CIF) to learn monotonic token-level alignments without timestamps. With length-aware localized matching, CLAR anchors short-entity acoustic cues and reduces representation dilution and attention drift. The retriever is trained with a multi-granularity objective combining global and local segment-level contrastive losses and a CIF quantity constraint. At inference, top-ranked hotwords are injected as contextual prompts for the Speech LLM, improving recognition without shallow fusion. Experiments show that CLAR significantly improves hotword retrieval and reduces both CER and B-WER against strong contextual ASR baselines.

CLAR: CIF-Localized Alignment for Retrieval-Augmented Speech LLM-Based Contextual ASR

Abstract

Speech LLM-based ASR often struggles with named entities and long-tail words due to strong internal language-model priors. Retrieval-augmented biasing can help, but its effectiveness depends on accurate hotword localization in full-utterance speech under weak supervision. We propose CLAR, a dual-encoder speech-text retriever that uses Continuous Integrate-and-Fire (CIF) to learn monotonic token-level alignments without timestamps. With length-aware localized matching, CLAR anchors short-entity acoustic cues and reduces representation dilution and attention drift. The retriever is trained with a multi-granularity objective combining global and local segment-level contrastive losses and a CIF quantity constraint. At inference, top-ranked hotwords are injected as contextual prompts for the Speech LLM, improving recognition without shallow fusion. Experiments show that CLAR significantly improves hotword retrieval and reduces both CER and B-WER against strong contextual ASR baselines.

Paper Structure

This paper contains 14 sections, 8 equations, 3 figures, 4 tables.

Figures (3)

  • Figure 1: CLAR retrieval-augmented contextual ASR: CIF-Localized Alignment retriever for localized hotword matching and prompt-based decoding with a Speech LLM.
  • Figure 2: The CLAR retriever with dual encoders, CIF-based monotonic alignment, and length-aware localized matching for hotword retrieval.
  • Figure 3: Case-study similarity map for hotword localization.