Table of Contents
Fetching ...

Extreme Encoder Output Frame Rate Reduction: Improving Computational Latencies of Large End-to-End Models

Rohit Prabhavalkar, Zhong Meng, Weiran Wang, Adam Stooke, Xingyu Cai, Yanzhang He, Arun Narayanan, Dongseong Hwang, Tara N. Sainath, Pedro J. Moreno

TL;DR

The paper addresses latency challenges in deploying large end-to-end ASR models by introducing extreme encoder frame-rate reduction through multiple funnel transformer layers in the conformer encoder. By achieving an encoder output duration of $f_\text{enc} = 2.56$ seconds and a reduction ratio $r_\text{enc}$ realized by stacking funnel layers, the authors report dramatic latency reductions (up to $82\%$ total decoding latency) with only about $3\%$ WER degradation on a large voice-search task. Decoding uses an alignment-length synchronous strategy with fixed per-step cost $C_\text{exp}$, enabling batching across utterances and substantial throughput gains on hardware accelerators like TPUs. The study also shows that a richer prediction network context (e.g., LSTM-based) and minimum word error rate (MWER) training can mitigate performance loss for extreme reductions, particularly on tail-word sets, culminating in a practical path to deploying large E2E ASR models with low latency for latency-sensitive applications. The work presents a concrete framework for extreme encoder reductions, providing design choices and ablations that balance latency and accuracy, and demonstrates meaningful, real-world latency gains for voice search tasks. $r_\text{enc} = \prod s_i$, $T_\text{max} = T'_/r_\text{enc}$, and $C_\text{exp}(T_\text{max} + U_\text{max})$ are key equations guiding the approach.

Abstract

The accuracy of end-to-end (E2E) automatic speech recognition (ASR) models continues to improve as they are scaled to larger sizes, with some now reaching billions of parameters. Widespread deployment and adoption of these models, however, requires computationally efficient strategies for decoding. In the present work, we study one such strategy: applying multiple frame reduction layers in the encoder to compress encoder outputs into a small number of output frames. While similar techniques have been investigated in previous work, we achieve dramatically more reduction than has previously been demonstrated through the use of multiple funnel reduction layers. Through ablations, we study the impact of various architectural choices in the encoder to identify the most effective strategies. We demonstrate that we can generate one encoder output frame for every 2.56 sec of input speech, without significantly affecting word error rate on a large-scale voice search task, while improving encoder and decoder latencies by 48% and 92% respectively, relative to a strong but computationally expensive baseline.

Extreme Encoder Output Frame Rate Reduction: Improving Computational Latencies of Large End-to-End Models

TL;DR

The paper addresses latency challenges in deploying large end-to-end ASR models by introducing extreme encoder frame-rate reduction through multiple funnel transformer layers in the conformer encoder. By achieving an encoder output duration of seconds and a reduction ratio realized by stacking funnel layers, the authors report dramatic latency reductions (up to total decoding latency) with only about WER degradation on a large voice-search task. Decoding uses an alignment-length synchronous strategy with fixed per-step cost , enabling batching across utterances and substantial throughput gains on hardware accelerators like TPUs. The study also shows that a richer prediction network context (e.g., LSTM-based) and minimum word error rate (MWER) training can mitigate performance loss for extreme reductions, particularly on tail-word sets, culminating in a practical path to deploying large E2E ASR models with low latency for latency-sensitive applications. The work presents a concrete framework for extreme encoder reductions, providing design choices and ablations that balance latency and accuracy, and demonstrates meaningful, real-world latency gains for voice search tasks. , , and are key equations guiding the approach.

Abstract

The accuracy of end-to-end (E2E) automatic speech recognition (ASR) models continues to improve as they are scaled to larger sizes, with some now reaching billions of parameters. Widespread deployment and adoption of these models, however, requires computationally efficient strategies for decoding. In the present work, we study one such strategy: applying multiple frame reduction layers in the encoder to compress encoder outputs into a small number of output frames. While similar techniques have been investigated in previous work, we achieve dramatically more reduction than has previously been demonstrated through the use of multiple funnel reduction layers. Through ablations, we study the impact of various architectural choices in the encoder to identify the most effective strategies. We demonstrate that we can generate one encoder output frame for every 2.56 sec of input speech, without significantly affecting word error rate on a large-scale voice search task, while improving encoder and decoder latencies by 48% and 92% respectively, relative to a strong but computationally expensive baseline.
Paper Structure (9 sections, 2 equations, 1 figure, 6 tables)

This paper contains 9 sections, 2 equations, 1 figure, 6 tables.

Figures (1)

  • Figure 1: The hybrid autoregressive transducer (HAT) variani2020:hat (left). We replace some of the layers with the funnel reduction variant described in Eq. \ref{['eq:funnel']}. The encoder structure (right) -- ($s^2_{15}, s^2_{13}, s^2_{11}$) (see notation in Sec. \ref{['sec:experimental_setup']}) -- corresponds to an encoder reduction factor, $r_\text{enc}=8$, with an encoder output duration, $f_\text{enc}=320$ms.