Sparse Binarization for Fast Keyword Spotting
Jonathan Svirsky, Uri Shaham, Ofir Lindenbaum
TL;DR
The paper tackles fast keyword spotting on resource-constrained edge devices by introducing SparkNet, a model that learns a sparse, binarized input representation and relies on a single linear classifier. The approach uses a lightweight 1D time-channel separable convolution backbone with Gaussian-relaxed binarization and a sparsity loss to focus on informative time-frequency regions, achieving high accuracy with a substantial reduction in multiply-accumulate operations. Empirical results on Google Speech Commands v1/v2 show competitive accuracy with much lower MACs and improved robustness to noise compared to state-of-the-art small models. Overall, SparkNet enables privacy-preserving, low-latency KWS on devices with limited power and memory, while providing insights via ablations and qualitative analysis into the learned binary gates.
Abstract
With the increasing prevalence of voice-activated devices and applications, keyword spotting (KWS) models enable users to interact with technology hands-free, enhancing convenience and accessibility in various contexts. Deploying KWS models on edge devices, such as smartphones and embedded systems, offers significant benefits for real-time applications, privacy, and bandwidth efficiency. However, these devices often possess limited computational power and memory. This necessitates optimizing neural network models for efficiency without significantly compromising their accuracy. To address these challenges, we propose a novel keyword-spotting model based on sparse input representation followed by a linear classifier. The model is four times faster than the previous state-of-the-art edge device-compatible model with better accuracy. We show that our method is also more robust in noisy environments while being fast. Our code is available at: https://github.com/jsvir/sparknet.
