Breaking Down Power Barriers in On-Device Streaming ASR: Insights and Solutions
Yang Li, Yuan Shangguan, Yuhao Wang, Liangzhen Lai, Ernie Chang, Changsheng Zhao, Yangyang Shi, Vikas Chandra
TL;DR
The paper tackles the power constraints of on-device streaming ASR by dissecting how weight configurations across the RNNT components—Encoder, Predictor, and Joiner—affect energy, latency, and accuracy. It reveals that memory traffic and invocation frequency dominate power consumption, with the lightweight but frequently invoked Joiner consuming a large share of energy, and that encoder size exhibits an exponential impact on accuracy. A component-aware compression strategy guided by power-to-accuracy sensitivity is proposed, prioritizing the Joiner, then the Predictor, and finally the Encoder, leveraging local memory when possible. Empirical results on LibriSpeech and Public Video show energy reductions up to 47% and RTF improvements up to 29% without sacrificing accuracy, illustrating practical savings for mobile and wearable devices and highlighting a scalable path toward more capable on-device streaming ASR as hardware memory hierarchies evolve.
Abstract
Power consumption plays a crucial role in on-device streaming speech recognition, significantly influencing the user experience. This study explores how the configuration of weight parameters in speech recognition models affects their overall energy efficiency. We found that the influence of these parameters on power consumption varies depending on factors such as invocation frequency and memory allocation. Leveraging these insights, we propose design principles that enhance on-device speech recognition models by reducing power consumption with minimal impact on accuracy. Our approach, which adjusts model components based on their specific energy sensitivities, achieves up to 47% lower energy usage while preserving comparable model accuracy and improving real-time performance compared to leading methods.
