Conformer-Based Speech Recognition On Extreme Edge-Computing Devices

Mingbin Xu; Alex Jin; Sicheng Wang; Mu Su; Tim Ng; Henry Mason; Shiyi Han; Zhihong Lei; Yaqiao Deng; Zhen Huang; Mahesh Krishnamoorthy

Conformer-Based Speech Recognition On Extreme Edge-Computing Devices

Mingbin Xu, Alex Jin, Sicheng Wang, Mu Su, Tim Ng, Henry Mason, Shiyi Han, Zhihong Lei, Yaqiao Deng, Zhen Huang, Mahesh Krishnamoorthy

TL;DR

The paper tackles the challenge of delivering accurate, streaming Conformer-based end-to-end ASR on resource-constrained edge devices to preserve privacy through offline processing. It combines architectural adaptations (Depthwise Separable Convolution, streaming-friendly chunked attention), memory-aware graph execution, and numerical optimizations, including a Mean Absolute Deviation ($MAD$) pre-normalizer for a robust $L_p$-norm normalization, with a scaled softmax approach to suit hardware limitations. Experiments on iPhone XR and Apple Watch Series 7 show up to 0.19 RTF and a 5.26× speedup over real time, along with substantial energy reductions while maintaining WER comparable to FP32 baselines. The work provides a general theory for numerical stabilization that can be applied to other transformer-based server-free AI tasks on edge hardware.

Abstract

With increasingly more powerful compute capabilities and resources in today's devices, traditionally compute-intensive automatic speech recognition (ASR) has been moving from the cloud to devices to better protect user privacy. However, it is still challenging to implement on-device ASR on resource-constrained devices, such as smartphones, smart wearables, and other smart home automation devices. In this paper, we propose a series of model architecture adaptions, neural network graph transformations, and numerical optimizations to fit an advanced Conformer based end-to-end streaming ASR system on resource-constrained devices without accuracy degradation. We achieve over 5.26 times faster than realtime (0.19 RTF) speech recognition on smart wearables while minimizing energy consumption and achieving state-of-the-art accuracy. The proposed methods are widely applicable to other transformer-based server-free AI applications. In addition, we provide a complete theory on optimal pre-normalizers that numerically stabilize layer normalization in any Lp-norm using any floating point precision.

Conformer-Based Speech Recognition On Extreme Edge-Computing Devices

TL;DR

) pre-normalizer for a robust

-norm normalization, with a scaled softmax approach to suit hardware limitations. Experiments on iPhone XR and Apple Watch Series 7 show up to 0.19 RTF and a 5.26× speedup over real time, along with substantial energy reductions while maintaining WER comparable to FP32 baselines. The work provides a general theory for numerical stabilization that can be applied to other transformer-based server-free AI tasks on edge hardware.

Abstract

Paper Structure (17 sections, 9 equations, 5 figures, 4 tables)

This paper contains 17 sections, 9 equations, 5 figures, 4 tables.

Introduction
Prior Work
Backbone Model
Proposed Optimizations
Depthwise Separable Convolution
Memory-aware Graph Execution
Stability of Layer Normalization
Scaling of Softmax
Experiments and Results
Setup
Performance
Energy
Numeric Stability
Quality
Conclusions
...and 2 more sections

Figures (5)

Figure 1: $bz$, $h$ and $f$ refers to batch size, number of attention heads and feature dimension respectively, whereas $d = f / h$. Firstly, we transposed the input and output of Conformer CTC, expanding the input tensor to the desired shape of $(B, C, 1, S)$. This transformation allowed us to execute most layers on the hardware accelerator as per Principle 1. Additionally, we extensively employed split and concatenation operations to enhance L2 cache residency (Principle 2). To address the issue of undesired memory copies resulting from batched matrix multiplication layers, we replaced them with Einstein summation operations (Principle 3).
Figure 2: Realtime Factor (RTF) of the original Conformer CTC vs Depthwise Separable Convolution (DWS) architectures. Blue and green bars represent the RTF on CPU and hardware accelerators, respectively. We also added a horizontal line at 0.5 to illustrate required RTF for ASR to process in realtime.
Figure 3: Energy consumption (in joules) for 200 queries of the original Conformer CTC vs Depthwise Separable Convolution (DWS) architectures. Blue and green bars represent the values on CPU and hardware accelerators, respectively. The y-axis is in log scale.
Figure 4: Distribution of the max value between vanilla convolution and DWS in log scale.
Figure 5: Distribution of Layernorm's input's max value in log scale.

Conformer-Based Speech Recognition On Extreme Edge-Computing Devices

TL;DR

Abstract

Conformer-Based Speech Recognition On Extreme Edge-Computing Devices

Authors

TL;DR

Abstract

Table of Contents

Figures (5)