Table of Contents
Fetching ...

Sparse Binarization for Fast Keyword Spotting

Jonathan Svirsky, Uri Shaham, Ofir Lindenbaum

TL;DR

The paper tackles fast keyword spotting on resource-constrained edge devices by introducing SparkNet, a model that learns a sparse, binarized input representation and relies on a single linear classifier. The approach uses a lightweight 1D time-channel separable convolution backbone with Gaussian-relaxed binarization and a sparsity loss to focus on informative time-frequency regions, achieving high accuracy with a substantial reduction in multiply-accumulate operations. Empirical results on Google Speech Commands v1/v2 show competitive accuracy with much lower MACs and improved robustness to noise compared to state-of-the-art small models. Overall, SparkNet enables privacy-preserving, low-latency KWS on devices with limited power and memory, while providing insights via ablations and qualitative analysis into the learned binary gates.

Abstract

With the increasing prevalence of voice-activated devices and applications, keyword spotting (KWS) models enable users to interact with technology hands-free, enhancing convenience and accessibility in various contexts. Deploying KWS models on edge devices, such as smartphones and embedded systems, offers significant benefits for real-time applications, privacy, and bandwidth efficiency. However, these devices often possess limited computational power and memory. This necessitates optimizing neural network models for efficiency without significantly compromising their accuracy. To address these challenges, we propose a novel keyword-spotting model based on sparse input representation followed by a linear classifier. The model is four times faster than the previous state-of-the-art edge device-compatible model with better accuracy. We show that our method is also more robust in noisy environments while being fast. Our code is available at: https://github.com/jsvir/sparknet.

Sparse Binarization for Fast Keyword Spotting

TL;DR

The paper tackles fast keyword spotting on resource-constrained edge devices by introducing SparkNet, a model that learns a sparse, binarized input representation and relies on a single linear classifier. The approach uses a lightweight 1D time-channel separable convolution backbone with Gaussian-relaxed binarization and a sparsity loss to focus on informative time-frequency regions, achieving high accuracy with a substantial reduction in multiply-accumulate operations. Empirical results on Google Speech Commands v1/v2 show competitive accuracy with much lower MACs and improved robustness to noise compared to state-of-the-art small models. Overall, SparkNet enables privacy-preserving, low-latency KWS on devices with limited power and memory, while providing insights via ablations and qualitative analysis into the learned binary gates.

Abstract

With the increasing prevalence of voice-activated devices and applications, keyword spotting (KWS) models enable users to interact with technology hands-free, enhancing convenience and accessibility in various contexts. Deploying KWS models on edge devices, such as smartphones and embedded systems, offers significant benefits for real-time applications, privacy, and bandwidth efficiency. However, these devices often possess limited computational power and memory. This necessitates optimizing neural network models for efficiency without significantly compromising their accuracy. To address these challenges, we propose a novel keyword-spotting model based on sparse input representation followed by a linear classifier. The model is four times faster than the previous state-of-the-art edge device-compatible model with better accuracy. We show that our method is also more robust in noisy environments while being fast. Our code is available at: https://github.com/jsvir/sparknet.
Paper Structure (12 sections, 3 equations, 2 figures, 5 tables)

This paper contains 12 sections, 3 equations, 2 figures, 5 tables.

Figures (2)

  • Figure 1: The proposed SparkNet model learns a binary representation ${\hbox{\boldmath $z$}}$ by using reparameterization trick. The random noise is added to the predicted ${\hbox{\boldmath $\mu$}} \in [-1,1]^{F \times T}$ and approximate binary ${\hbox{\boldmath $z$}}$ is obtained by centering the values on $0.5$ and clipping n the interval $[0, 1]$. Only most informative features for classification task get positive values in ${\hbox{\boldmath $z$}}$.
  • Figure 2: 5 randomly chosen words from 5 categories in MFCC representation (a) and the predicted binary representation (b).