Table of Contents
Fetching ...

Boosting keyword spotting through on-device learnable user speech characteristics

Cristian Cioflan, Lukas Cavigelli, Luca Benini

TL;DR

The paper tackles the challenge of adapting keyword spotting systems to individual users under TinyML constraints by freezing a pretrained backbone and learning a lightweight user embedding, fused with backbone features at the classifier input. The proposed method enables on-device, few-shot adaptation that captures user-specific speech characteristics, achieving up to a 19% relative reduction in error on the Google Speech Commands 35-class task and substantial energy/memory efficiency. Key contributions include a detailed exploration of embedding fusion strategies (e.g., multiplicative fusion), analysis of training costs across backbones, and demonstration that learning embeddings can outperform backbone-only fine-tuning while requiring orders of magnitude less memory. The approach promises practical speaker-aware KWS on ultra-low-power devices, with memory footprints under 16 kB and per-epoch energy around 13 μJ, making it well-suited for extreme edge deployments.

Abstract

Keyword spotting systems for always-on TinyML-constrained applications require on-site tuning to boost the accuracy of offline trained classifiers when deployed in unseen inference conditions. Adapting to the speech peculiarities of target users requires many in-domain samples, often unavailable in real-world scenarios. Furthermore, current on-device learning techniques rely on computationally intensive and memory-hungry backbone update schemes, unfit for always-on, battery-powered devices. In this work, we propose a novel on-device learning architecture, composed of a pretrained backbone and a user-aware embedding learning the user's speech characteristics. The so-generated features are fused and used to classify the input utterance. For domain shifts generated by unseen speakers, we measure error rate reductions of up to 19% from 30.1% to 24.3% based on the 35-class problem of the Google Speech Commands dataset, through the inexpensive update of the user projections. We moreover demonstrate the few-shot learning capabilities of our proposed architecture in sample- and class-scarce learning conditions. With 23.7 kparameters and 1 MFLOP per epoch required for on-device training, our system is feasible for TinyML applications aimed at battery-powered microcontrollers.

Boosting keyword spotting through on-device learnable user speech characteristics

TL;DR

The paper tackles the challenge of adapting keyword spotting systems to individual users under TinyML constraints by freezing a pretrained backbone and learning a lightweight user embedding, fused with backbone features at the classifier input. The proposed method enables on-device, few-shot adaptation that captures user-specific speech characteristics, achieving up to a 19% relative reduction in error on the Google Speech Commands 35-class task and substantial energy/memory efficiency. Key contributions include a detailed exploration of embedding fusion strategies (e.g., multiplicative fusion), analysis of training costs across backbones, and demonstration that learning embeddings can outperform backbone-only fine-tuning while requiring orders of magnitude less memory. The approach promises practical speaker-aware KWS on ultra-low-power devices, with memory footprints under 16 kB and per-epoch energy around 13 μJ, making it well-suited for extreme edge deployments.

Abstract

Keyword spotting systems for always-on TinyML-constrained applications require on-site tuning to boost the accuracy of offline trained classifiers when deployed in unseen inference conditions. Adapting to the speech peculiarities of target users requires many in-domain samples, often unavailable in real-world scenarios. Furthermore, current on-device learning techniques rely on computationally intensive and memory-hungry backbone update schemes, unfit for always-on, battery-powered devices. In this work, we propose a novel on-device learning architecture, composed of a pretrained backbone and a user-aware embedding learning the user's speech characteristics. The so-generated features are fused and used to classify the input utterance. For domain shifts generated by unseen speakers, we measure error rate reductions of up to 19% from 30.1% to 24.3% based on the 35-class problem of the Google Speech Commands dataset, through the inexpensive update of the user projections. We moreover demonstrate the few-shot learning capabilities of our proposed architecture in sample- and class-scarce learning conditions. With 23.7 kparameters and 1 MFLOP per epoch required for on-device training, our system is feasible for TinyML applications aimed at battery-powered microcontrollers.
Paper Structure (10 sections, 2 figures, 3 tables)

This paper contains 10 sections, 2 figures, 3 tables.

Figures (2)

  • Figure 1: Architecture overview. The backbone uses the audio recording to produce activations, whereas the embedding layer uses the unique ID of the speaker to learn user features. The activations and the user features are fused and employed by a fully connected classifier to classify the input utterance.
  • Figure 2: Error rate average and standard deviation for 35, comparing learning speech characteristics through embeddings against three update methodologies. All methodologies include loss-based early stopping with a patience of 5 epochs.