Conformer-Based Speech Recognition On Extreme Edge-Computing Devices
Mingbin Xu, Alex Jin, Sicheng Wang, Mu Su, Tim Ng, Henry Mason, Shiyi Han, Zhihong Lei, Yaqiao Deng, Zhen Huang, Mahesh Krishnamoorthy
TL;DR
The paper tackles the challenge of delivering accurate, streaming Conformer-based end-to-end ASR on resource-constrained edge devices to preserve privacy through offline processing. It combines architectural adaptations (Depthwise Separable Convolution, streaming-friendly chunked attention), memory-aware graph execution, and numerical optimizations, including a Mean Absolute Deviation ($MAD$) pre-normalizer for a robust $L_p$-norm normalization, with a scaled softmax approach to suit hardware limitations. Experiments on iPhone XR and Apple Watch Series 7 show up to 0.19 RTF and a 5.26× speedup over real time, along with substantial energy reductions while maintaining WER comparable to FP32 baselines. The work provides a general theory for numerical stabilization that can be applied to other transformer-based server-free AI tasks on edge hardware.
Abstract
With increasingly more powerful compute capabilities and resources in today's devices, traditionally compute-intensive automatic speech recognition (ASR) has been moving from the cloud to devices to better protect user privacy. However, it is still challenging to implement on-device ASR on resource-constrained devices, such as smartphones, smart wearables, and other smart home automation devices. In this paper, we propose a series of model architecture adaptions, neural network graph transformations, and numerical optimizations to fit an advanced Conformer based end-to-end streaming ASR system on resource-constrained devices without accuracy degradation. We achieve over 5.26 times faster than realtime (0.19 RTF) speech recognition on smart wearables while minimizing energy consumption and achieving state-of-the-art accuracy. The proposed methods are widely applicable to other transformer-based server-free AI applications. In addition, we provide a complete theory on optimal pre-normalizers that numerically stabilize layer normalization in any Lp-norm using any floating point precision.
