Table of Contents
Fetching ...

Learnable Pulse Accumulation for On-Device Speech Recognition: How Much Attention Do You Need?

Yakov Pyotr Shkolnikov

Abstract

Self-attention scales quadratically with sequence length, limiting transformer-based speech models on edge devices. We introduce the Learnable Pulse Accumulator (LPA), an O(n) replacement that substitutes key-query dot products with learned gating functions: content-dependent rectangular pulses, periodic windows, and position-dependent basis functions. An MSE diagnostic sweep determines per-layer replacement difficulty and ordering. Replacing 8 of 12 wav2vec2-base layers yields 10.61% word error rate (WER) on LibriSpeech test-clean, +7.24 percentage points (pp) over the 3.37% baseline, with 3.27x speedup at 120s audio on Apple M4 Pro via an optimized MLX inference path. Cross-domain validation on SepFormer speech enhancement shows all 16 intra-chunk attention layers can be replaced without collapse, suggesting the depth wall arises from linguistic computation rather than an LPA limitation. LPA's near-binary gates at inference enable dense GPU computation with no CPU-GPU synchronization, and all operations map to mobile neural accelerators.

Learnable Pulse Accumulation for On-Device Speech Recognition: How Much Attention Do You Need?

Abstract

Self-attention scales quadratically with sequence length, limiting transformer-based speech models on edge devices. We introduce the Learnable Pulse Accumulator (LPA), an O(n) replacement that substitutes key-query dot products with learned gating functions: content-dependent rectangular pulses, periodic windows, and position-dependent basis functions. An MSE diagnostic sweep determines per-layer replacement difficulty and ordering. Replacing 8 of 12 wav2vec2-base layers yields 10.61% word error rate (WER) on LibriSpeech test-clean, +7.24 percentage points (pp) over the 3.37% baseline, with 3.27x speedup at 120s audio on Apple M4 Pro via an optimized MLX inference path. Cross-domain validation on SepFormer speech enhancement shows all 16 intra-chunk attention layers can be replaced without collapse, suggesting the depth wall arises from linguistic computation rather than an LPA limitation. LPA's near-binary gates at inference enable dense GPU computation with no CPU-GPU synchronization, and all operations map to mobile neural accelerators.
Paper Structure (18 sections, 4 equations, 3 figures, 4 tables, 1 algorithm)

This paper contains 18 sections, 4 equations, 3 figures, 4 tables, 1 algorithm.

Figures (3)

  • Figure 1: (a) Standard self-attention computes an $n{\times}n$ matrix via $QK^\top$. (b) LPA replaces this with three learned gate types that define soft windows over the sequence. Gated accumulation produces per-pulse summaries at $O(nP)$ cost. All operations are accelerator-compatible.
  • Figure 2: WER (%) vs. number of replaced layers across four cumulative configurations (full data in Table \ref{['tab:progressive']}, Appendix). MSE-ordered replacement yields lower WER at every stage.
  • Figure 3: Speedup vs. FP16 attention baseline on Apple M4 Pro, batch 1. Top: inference time (ms, log scale). Bottom: speedup ratio. FP16 attention is slower than FP32 on MPS (Table \ref{['tab:speed']}).