Table of Contents
Fetching ...

Turbocharge Speech Understanding with Pilot Inference

Rongxiang Wang, Felix Xiaozhu Lin

TL;DR

This paper targets fast speech understanding on resource-constrained edge devices by proposing a hybrid on-device/offload pipeline. It introduces late contextualization, pilot inference, and autoregression offramps to balance computation, latency, and offloading across existing speech models and pipelines. The PASU prototype combines beam collapse/termination, CTC leap, and confidence estimation for selective offloading, achieving state-of-the-art accuracy while delivering approximately 2× improvements in end-to-end latency and a 2× reduction in offloading on Arm platforms with 6-8 cores. These results demonstrate a practical, adaptable framework for real-time edge speech understanding.

Abstract

Modern speech understanding (SU) runs a sophisticated pipeline: ingesting streaming voice input, the pipeline executes encoder-decoder based deep neural networks repeatedly; by doing so, the pipeline generates tentative outputs (called hypotheses), and periodically scores the hypotheses. This paper sets to accelerate SU on resource-constrained edge devices. It takes a hybrid approach: to speed up on-device execution; to offload inputs that are beyond the device's capacity. While the approach is well-known, we address SU's unique challenges with novel techniques: (1) late contextualization, which executes a model's attentive encoder in parallel to the input ingestion; (2) pilot inference, which mitigates the SU pipeline's temporal load imbalance; (3) autoregression offramps, which evaluate offloading decisions based on pilot inferences and hypotheses. Our techniques are compatible with existing speech models, pipelines, and frameworks; they can be applied independently or in combination. Our prototype, called PASU, is tested on Arm platforms with 6 - 8 cores: it delivers SOTA accuracy; it reduces the end-to-end latency by 2x and reduces the offloading needs by 2x.

Turbocharge Speech Understanding with Pilot Inference

TL;DR

This paper targets fast speech understanding on resource-constrained edge devices by proposing a hybrid on-device/offload pipeline. It introduces late contextualization, pilot inference, and autoregression offramps to balance computation, latency, and offloading across existing speech models and pipelines. The PASU prototype combines beam collapse/termination, CTC leap, and confidence estimation for selective offloading, achieving state-of-the-art accuracy while delivering approximately 2× improvements in end-to-end latency and a 2× reduction in offloading on Arm platforms with 6-8 cores. These results demonstrate a practical, adaptable framework for real-time edge speech understanding.

Abstract

Modern speech understanding (SU) runs a sophisticated pipeline: ingesting streaming voice input, the pipeline executes encoder-decoder based deep neural networks repeatedly; by doing so, the pipeline generates tentative outputs (called hypotheses), and periodically scores the hypotheses. This paper sets to accelerate SU on resource-constrained edge devices. It takes a hybrid approach: to speed up on-device execution; to offload inputs that are beyond the device's capacity. While the approach is well-known, we address SU's unique challenges with novel techniques: (1) late contextualization, which executes a model's attentive encoder in parallel to the input ingestion; (2) pilot inference, which mitigates the SU pipeline's temporal load imbalance; (3) autoregression offramps, which evaluate offloading decisions based on pilot inferences and hypotheses. Our techniques are compatible with existing speech models, pipelines, and frameworks; they can be applied independently or in combination. Our prototype, called PASU, is tested on Arm platforms with 6 - 8 cores: it delivers SOTA accuracy; it reduces the end-to-end latency by 2x and reduces the offloading needs by 2x.
Paper Structure (1 section)

This paper contains 1 section.

Table of Contents

  1. Conclusions