Table of Contents
Fetching ...

Speech LLMs are Contextual Reasoning Transcribers

Keqi Deng, Ruchao Fan, Bo Ren, Yiming Wang, Jinyu Li

Abstract

Despite extensions to speech inputs, effectively leveraging the rich knowledge and contextual understanding of large language models (LLMs) in automatic speech recognition (ASR) remains non-trivial, as the task primarily involves direct speech-to-text mapping. To address this, this paper proposes chain-of-thought ASR (CoT-ASR), which constructs a reasoning chain that enables LLMs to first analyze the input speech and generate contextual analysis, thereby fully exploiting their generative capabilities. With this contextual reasoning, CoT-ASR then performs more informed speech recognition and completes both reasoning and transcription in a single pass. Moreover, CoT-ASR naturally supports user-guided transcription: while designed to self-generate reasoning, it can also seamlessly incorporate user-provided context to guide transcription, further extending ASR functionality. To reduce the modality gap, this paper introduces a CTC-guided Modality Adapter, which uses CTC non-blank token probabilities to weight LLM embeddings, efficiently aligning speech encoder outputs with the LLM's textual latent space. Experiments show that, compared to standard LLM-based ASR, CoT-ASR achieves a relative reduction of 8.7% in word error rate (WER) and 16.9% in entity error rate (EER).

Speech LLMs are Contextual Reasoning Transcribers

Abstract

Despite extensions to speech inputs, effectively leveraging the rich knowledge and contextual understanding of large language models (LLMs) in automatic speech recognition (ASR) remains non-trivial, as the task primarily involves direct speech-to-text mapping. To address this, this paper proposes chain-of-thought ASR (CoT-ASR), which constructs a reasoning chain that enables LLMs to first analyze the input speech and generate contextual analysis, thereby fully exploiting their generative capabilities. With this contextual reasoning, CoT-ASR then performs more informed speech recognition and completes both reasoning and transcription in a single pass. Moreover, CoT-ASR naturally supports user-guided transcription: while designed to self-generate reasoning, it can also seamlessly incorporate user-provided context to guide transcription, further extending ASR functionality. To reduce the modality gap, this paper introduces a CTC-guided Modality Adapter, which uses CTC non-blank token probabilities to weight LLM embeddings, efficiently aligning speech encoder outputs with the LLM's textual latent space. Experiments show that, compared to standard LLM-based ASR, CoT-ASR achieves a relative reduction of 8.7% in word error rate (WER) and 16.9% in entity error rate (EER).

Paper Structure

This paper contains 28 sections, 8 equations, 3 figures, 9 tables.

Figures (3)

  • Figure 1: Illustration of CoT-ASR. The text prompt is a fixed template for the ASR task. The output sequence comprises contextual analysis and transcription, marked by the <CONTEXT> and <TRANSCRIPT> tags, respectively. The reasoning-based contextual analysis is highlighted in red, while the subsequent transcription is marked in green. Both are produced sequentially in a single one-pass generation.
  • Figure 2: Comparison between CoT-ASR and standard LLM-based ASR. During generation, CoT-ASR first performs a contextual reasoning analysis, which then guides the subsequent transcription. This example illustrates how CoT-ASR leverages the rich knowledge of LLMs to ultimately improve transcription quality.
  • Figure 3: Illustration of the CTC-guided Modality Adapter. The linear output layer projects the encoder outputs to the CTC vocabulary size including the blank token, while the linear layer maps the encoder outputs to the LLM hidden dimension. $\bm{\otimes}$ denotes matrix multiplication and $\bm{\oplus}$ denotes addition. Gate denotes a frame-wise scalar from the linear projection and sigmoid that modulates the residual contribution.