Table of Contents
Fetching ...

Breaking Down Power Barriers in On-Device Streaming ASR: Insights and Solutions

Yang Li, Yuan Shangguan, Yuhao Wang, Liangzhen Lai, Ernie Chang, Changsheng Zhao, Yangyang Shi, Vikas Chandra

TL;DR

The paper tackles the power constraints of on-device streaming ASR by dissecting how weight configurations across the RNNT components—Encoder, Predictor, and Joiner—affect energy, latency, and accuracy. It reveals that memory traffic and invocation frequency dominate power consumption, with the lightweight but frequently invoked Joiner consuming a large share of energy, and that encoder size exhibits an exponential impact on accuracy. A component-aware compression strategy guided by power-to-accuracy sensitivity is proposed, prioritizing the Joiner, then the Predictor, and finally the Encoder, leveraging local memory when possible. Empirical results on LibriSpeech and Public Video show energy reductions up to 47% and RTF improvements up to 29% without sacrificing accuracy, illustrating practical savings for mobile and wearable devices and highlighting a scalable path toward more capable on-device streaming ASR as hardware memory hierarchies evolve.

Abstract

Power consumption plays a crucial role in on-device streaming speech recognition, significantly influencing the user experience. This study explores how the configuration of weight parameters in speech recognition models affects their overall energy efficiency. We found that the influence of these parameters on power consumption varies depending on factors such as invocation frequency and memory allocation. Leveraging these insights, we propose design principles that enhance on-device speech recognition models by reducing power consumption with minimal impact on accuracy. Our approach, which adjusts model components based on their specific energy sensitivities, achieves up to 47% lower energy usage while preserving comparable model accuracy and improving real-time performance compared to leading methods.

Breaking Down Power Barriers in On-Device Streaming ASR: Insights and Solutions

TL;DR

The paper tackles the power constraints of on-device streaming ASR by dissecting how weight configurations across the RNNT components—Encoder, Predictor, and Joiner—affect energy, latency, and accuracy. It reveals that memory traffic and invocation frequency dominate power consumption, with the lightweight but frequently invoked Joiner consuming a large share of energy, and that encoder size exhibits an exponential impact on accuracy. A component-aware compression strategy guided by power-to-accuracy sensitivity is proposed, prioritizing the Joiner, then the Predictor, and finally the Encoder, leveraging local memory when possible. Empirical results on LibriSpeech and Public Video show energy reductions up to 47% and RTF improvements up to 29% without sacrificing accuracy, illustrating practical savings for mobile and wearable devices and highlighting a scalable path toward more capable on-device streaming ASR as hardware memory hierarchies evolve.

Abstract

Power consumption plays a crucial role in on-device streaming speech recognition, significantly influencing the user experience. This study explores how the configuration of weight parameters in speech recognition models affects their overall energy efficiency. We found that the influence of these parameters on power consumption varies depending on factors such as invocation frequency and memory allocation. Leveraging these insights, we propose design principles that enhance on-device speech recognition models by reducing power consumption with minimal impact on accuracy. Our approach, which adjusts model components based on their specific energy sensitivities, achieves up to 47% lower energy usage while preserving comparable model accuracy and improving real-time performance compared to leading methods.
Paper Structure (24 sections, 4 equations, 8 figures, 5 tables)

This paper contains 24 sections, 4 equations, 8 figures, 5 tables.

Figures (8)

  • Figure 1: A schematic representation for the Transformer-based Neural Transducer.
  • Figure 2: Architecture of mobile and wearable devices.
  • Figure 3: Models trained on LibriSpeech: Model power consumption with compressing an individual component (Encoder, Predictor, or Joiner) while keeping the sizes of the other two components constant.
  • Figure 4: Models trained on LibriSpeech: Word error rate on Test-Clean with compressing an individual component (Encoder, Predictor, or Joiner) while keeping the sizes of the other two components constant.
  • Figure 5: Models trained on LibriSpeech: Word error rate on Test-Other with compressing an individual component (Encoder, Predictor, or Joiner) while keeping the sizes of the other two components constant.
  • ...and 3 more figures