Speech LLMs are Contextual Reasoning Transcribers

Keqi Deng; Ruchao Fan; Bo Ren; Yiming Wang; Jinyu Li

Speech LLMs are Contextual Reasoning Transcribers

Keqi Deng, Ruchao Fan, Bo Ren, Yiming Wang, Jinyu Li

Abstract

Despite extensions to speech inputs, effectively leveraging the rich knowledge and contextual understanding of large language models (LLMs) in automatic speech recognition (ASR) remains non-trivial, as the task primarily involves direct speech-to-text mapping. To address this, this paper proposes chain-of-thought ASR (CoT-ASR), which constructs a reasoning chain that enables LLMs to first analyze the input speech and generate contextual analysis, thereby fully exploiting their generative capabilities. With this contextual reasoning, CoT-ASR then performs more informed speech recognition and completes both reasoning and transcription in a single pass. Moreover, CoT-ASR naturally supports user-guided transcription: while designed to self-generate reasoning, it can also seamlessly incorporate user-provided context to guide transcription, further extending ASR functionality. To reduce the modality gap, this paper introduces a CTC-guided Modality Adapter, which uses CTC non-blank token probabilities to weight LLM embeddings, efficiently aligning speech encoder outputs with the LLM's textual latent space. Experiments show that, compared to standard LLM-based ASR, CoT-ASR achieves a relative reduction of 8.7% in word error rate (WER) and 16.9% in entity error rate (EER).

Speech LLMs are Contextual Reasoning Transcribers

Abstract

Speech LLMs are Contextual Reasoning Transcribers

Abstract

Paper Structure

Table of Contents

Figures (3)