Table of Contents
Fetching ...

NLE: Non-autoregressive LLM-based ASR by Transcript Editing

Avihu Dekel, Samuel Thomas, Takashi Fukada, George Saon

TL;DR

NLE, a non-autoregressive (NAR) approach that formulates speech recognition as conditional transcript editing, enabling fully parallel prediction, making it suitable for real-time applications.

Abstract

While autoregressive (AR) LLM-based ASR systems achieve strong accuracy, their sequential decoding limits parallelism and incurs high latency. We propose NLE, a non-autoregressive (NAR) approach that formulates speech recognition as conditional transcript editing, enabling fully parallel prediction. NLE extracts acoustic embeddings and an initial hypothesis from a pretrained speech encoder, then refines the hypothesis using a bidirectional LLM editor trained with a latent alignment objective. An interleaved padding strategy exploits the identity mapping bias of Transformers, allowing the model to focus on corrections rather than full reconstruction. On the Open ASR leaderboard, NLE++ achieves 5.67% average WER with an RTFx (inverse real-time factor) of 1630. In single-utterance scenarios, NLE achieves 27x speedup over the AR baseline, making it suitable for real-time applications.

NLE: Non-autoregressive LLM-based ASR by Transcript Editing

TL;DR

NLE, a non-autoregressive (NAR) approach that formulates speech recognition as conditional transcript editing, enabling fully parallel prediction, making it suitable for real-time applications.

Abstract

While autoregressive (AR) LLM-based ASR systems achieve strong accuracy, their sequential decoding limits parallelism and incurs high latency. We propose NLE, a non-autoregressive (NAR) approach that formulates speech recognition as conditional transcript editing, enabling fully parallel prediction. NLE extracts acoustic embeddings and an initial hypothesis from a pretrained speech encoder, then refines the hypothesis using a bidirectional LLM editor trained with a latent alignment objective. An interleaved padding strategy exploits the identity mapping bias of Transformers, allowing the model to focus on corrections rather than full reconstruction. On the Open ASR leaderboard, NLE++ achieves 5.67% average WER with an RTFx (inverse real-time factor) of 1630. In single-utterance scenarios, NLE achieves 27x speedup over the AR baseline, making it suitable for real-time applications.
Paper Structure (31 sections, 5 equations, 5 figures, 5 tables)

This paper contains 31 sections, 5 equations, 5 figures, 5 tables.

Figures (5)

  • Figure 1: Open ASR leaderboard WER-RTFx tradeoff comparing NLE and NLE++ against top-6 models (as of Feb 2026). Both NLE variants lie on the Pareto frontier (no other model achieves both lower WER and higher RTFx), achieving competitive accuracy with superior inference speed.
  • Figure 2: Overview of NLE architecture. The frozen pretrained CTC encoder produces acoustic embeddings and an initial CTC hypothesis. The hypothesis is tokenized and interleaved with insertion slots ($\epsilon$), then concatenated with the projected speech embeddings. The LoRA-adapted bidirectional LLM editor predicts the edited transcript using a CTC objective. The output can be iteratively re-edited (see Section \ref{['sec:multistep']}).
  • Figure 3: Validation loss over training steps for ablation study (see Section \ref{['sec:ablation']}). NLE (full model) achieves the lowest validation loss, confirming that each design choice contributes positively to overall performance.
  • Figure 4: Insertion, deletion and substitution rates (%) for three conditions: average across all datasets, AMI-SDM, and MLS-PT.
  • Figure 5: Inference time breakdown across the different stages of NLE. The encoder dominates at 66% of total time, with the LLM contributing $\sim$30%, and all remaining stages under 4%.