Speculative Speech Recognition by Audio-Prefixed Low-Rank Adaptation of Language Models

Bolaji Yusuf; Murali Karthick Baskar; Andrew Rosenberg; Bhuvana Ramabhadran

Speculative Speech Recognition by Audio-Prefixed Low-Rank Adaptation of Language Models

Bolaji Yusuf, Murali Karthick Baskar, Andrew Rosenberg, Bhuvana Ramabhadran

TL;DR

This work addresses reducing end-to-end latency in speech processing by enabling speculative speech recognition (SSR), where a system transcribes the prefix of an utterance and speculates its suffix before the user finishes speaking. The main approach combines a Conformer-Transducer ASR with an audio-conditioned Transformer LM, using a fixed-length audio prefix and LoRA adapters to incorporate acoustic context and ASR errors into suffix generation. A novel AWSED alignment method and a suffix-focused metric, SOWER, enable effective training and evaluation of speculation, with experiments showing that audio conditioning yields meaningful gains across Librispeech and multi-domain datasets. The findings suggest that SSR can substantially reduce latency in downstream NLP tasks, especially when the LM is audio-aware and trained with ASR-aware finetuning, pointing to practical negative-latency potential given efficient LM inference.

Abstract

This paper explores speculative speech recognition (SSR), where we empower conventional automatic speech recognition (ASR) with speculation capabilities, allowing the recognizer to run ahead of audio. We introduce a metric for measuring SSR performance and we propose a model which does SSR by combining a RNN-Transducer-based ASR system with an audio-prefixed language model (LM). The ASR system transcribes ongoing audio and feeds the resulting transcripts, along with an audio-dependent prefix, to the LM, which speculates likely completions for the transcriptions. We experiment with a variety of ASR datasets on which show the efficacy our method and the feasibility of SSR as a method of reducing ASR latency.

Speculative Speech Recognition by Audio-Prefixed Low-Rank Adaptation of Language Models

TL;DR

Abstract

Paper Structure (12 sections, 8 equations, 2 figures, 3 tables)

This paper contains 12 sections, 8 equations, 2 figures, 3 tables.

Introduction
Methods
Baseline speculative ASR with ASR-LM hybrid
Audio-aware speculative ASR
Alignment and finetuning for speculation
Experiments
Metrics
Datasets and model architecture
Speculation systems
Librispeech results
Multi-domain results
Conclusions

Figures (2)

Figure 1: Illustration of the proposed model with trainable parameters in blue. A Conformer-Transducer ASR model decodes the speech into text. Then a language model is prompted with a prefix computed from the Conformer encoder output to predict likely completions for the partial ASR hypothesis.
Figure 2: AWSED procedure for computing the optimal alignment between a hypothesis prefix and a full reference.

Speculative Speech Recognition by Audio-Prefixed Low-Rank Adaptation of Language Models

TL;DR

Abstract

Speculative Speech Recognition by Audio-Prefixed Low-Rank Adaptation of Language Models

Authors

TL;DR

Abstract

Table of Contents

Figures (2)