Table of Contents
Fetching ...

NGPU-LM: GPU-Accelerated N-Gram Language Model for Context-Biasing in Greedy ASR Decoding

Vladimir Bataev, Andrei Andrusenko, Lilit Grigoryan, Aleksandr Laptev, Vitaly Lavrukhin, Boris Ginsburg

TL;DR

NGPU-LM introduces a GPU-accelerated, batched n-gram language model with a trie-based data structure to enable fast context-biasing during greedy ASR decoding across CTC, Transducers, and AED architectures. By achieving full-vocabulary scoring and leveraging CUDA-graph-accelerated kernels, it maintains modest overhead while significantly narrowing the performance gap between greedy decoding and beam search, especially on out-of-domain data. The approach demonstrates up to 10.6% relative WER improvement and substantial reductions in decoding time, with robust results across high- and low-resource regimes, and is released as open-source. This work offers a practical path to scalable, context-aware decoding in industry-scale ASR systems using lightweight external LMs.

Abstract

Statistical n-gram language models are widely used for context-biasing tasks in Automatic Speech Recognition (ASR). However, existing implementations lack computational efficiency due to poor parallelization, making context-biasing less appealing for industrial use. This work rethinks data structures for statistical n-gram language models to enable fast and parallel operations for GPU-optimized inference. Our approach, named NGPU-LM, introduces customizable greedy decoding for all major ASR model types - including transducers, attention encoder-decoder models, and CTC - with less than 7% computational overhead. The proposed approach can eliminate more than 50% of the accuracy gap between greedy and beam search for out-of-domain scenarios while avoiding significant slowdown caused by beam search. The implementation of the proposed NGPU-LM is open-sourced.

NGPU-LM: GPU-Accelerated N-Gram Language Model for Context-Biasing in Greedy ASR Decoding

TL;DR

NGPU-LM introduces a GPU-accelerated, batched n-gram language model with a trie-based data structure to enable fast context-biasing during greedy ASR decoding across CTC, Transducers, and AED architectures. By achieving full-vocabulary scoring and leveraging CUDA-graph-accelerated kernels, it maintains modest overhead while significantly narrowing the performance gap between greedy decoding and beam search, especially on out-of-domain data. The approach demonstrates up to 10.6% relative WER improvement and substantial reductions in decoding time, with robust results across high- and low-resource regimes, and is released as open-source. This work offers a practical path to scalable, context-aware decoding in industry-scale ASR systems using lightweight external LMs.

Abstract

Statistical n-gram language models are widely used for context-biasing tasks in Automatic Speech Recognition (ASR). However, existing implementations lack computational efficiency due to poor parallelization, making context-biasing less appealing for industrial use. This work rethinks data structures for statistical n-gram language models to enable fast and parallel operations for GPU-optimized inference. Our approach, named NGPU-LM, introduces customizable greedy decoding for all major ASR model types - including transducers, attention encoder-decoder models, and CTC - with less than 7% computational overhead. The proposed approach can eliminate more than 50% of the accuracy gap between greedy and beam search for out-of-domain scenarios while avoiding significant slowdown caused by beam search. The implementation of the proposed NGPU-LM is open-sourced.

Paper Structure

This paper contains 9 sections, 2 figures, 3 tables, 1 algorithm.

Figures (2)

  • Figure 1: 3-gram LM built from the text "the cat sat on the mat". Arrows: black – unigram and bigram transitions; green – 3-gram (highest order) transitions; pink dashed – backoff transitions. Double circles: final states.
  • Figure : Inference of NGPU-LM