Table of Contents
Fetching ...

Self-Learning for Personalized Keyword Spotting on Ultra-Low-Power Audio Sensors

Manuele Rusci, Francesco Paci, Marco Fariselli, Eric Flamand, Tinne Tuytelaars

TL;DR

This work presents a self-learning pipeline for on-device personalization of keyword spotting on ultra-low-power sensors. By generating pseudo-labels from embedding-prototype similarity and incrementally fine-tuning a lightweight encoder, the method achieves substantial accuracy gains over pretrained generic-keyword baselines while operating within tight energy budgets suitable for edge devices. The approach is validated on public HeySnips/Hey Snapdragon datasets and a real-world HeySnips-REC collection, showing up to +19.2% improvements and real-time labeling at a few milliwatts, with training energy an order of magnitude lower than labeling under certain conditions. This enables self-adaptive personalized KWS at the extreme edge, with practical implications for battery-powered IoT and wearables; code is provided at GitHub.

Abstract

This paper proposes a self-learning method to incrementally train (fine-tune) a personalized Keyword Spotting (KWS) model after the deployment on ultra-low power smart audio sensors. We address the fundamental problem of the absence of labeled training data by assigning pseudo-labels to the new recorded audio frames based on a similarity score with respect to few user recordings. By experimenting with multiple KWS models with a number of parameters up to 0.5M on two public datasets, we show an accuracy improvement of up to +19.2% and +16.0% vs. the initial models pretrained on a large set of generic keywords. The labeling task is demonstrated on a sensor system composed of a low-power microphone and an energy-efficient Microcontroller (MCU). By efficiently exploiting the heterogeneous processing engines of the MCU, the always-on labeling task runs in real-time with an average power cost of up to 8.2 mW. On the same platform, we estimate an energy cost for on-device training 10x lower than the labeling energy if sampling a new utterance every 6.1 s or 18.8 s with a DS-CNN-S or a DS-CNN-M model. Our empirical result paves the way to self-adaptive personalized KWS sensors at the extreme edge.

Self-Learning for Personalized Keyword Spotting on Ultra-Low-Power Audio Sensors

TL;DR

This work presents a self-learning pipeline for on-device personalization of keyword spotting on ultra-low-power sensors. By generating pseudo-labels from embedding-prototype similarity and incrementally fine-tuning a lightweight encoder, the method achieves substantial accuracy gains over pretrained generic-keyword baselines while operating within tight energy budgets suitable for edge devices. The approach is validated on public HeySnips/Hey Snapdragon datasets and a real-world HeySnips-REC collection, showing up to +19.2% improvements and real-time labeling at a few milliwatts, with training energy an order of magnitude lower than labeling under certain conditions. This enables self-adaptive personalized KWS at the extreme edge, with practical implications for battery-powered IoT and wearables; code is provided at GitHub.

Abstract

This paper proposes a self-learning method to incrementally train (fine-tune) a personalized Keyword Spotting (KWS) model after the deployment on ultra-low power smart audio sensors. We address the fundamental problem of the absence of labeled training data by assigning pseudo-labels to the new recorded audio frames based on a similarity score with respect to few user recordings. By experimenting with multiple KWS models with a number of parameters up to 0.5M on two public datasets, we show an accuracy improvement of up to +19.2% and +16.0% vs. the initial models pretrained on a large set of generic keywords. The labeling task is demonstrated on a sensor system composed of a low-power microphone and an energy-efficient Microcontroller (MCU). By efficiently exploiting the heterogeneous processing engines of the MCU, the always-on labeling task runs in real-time with an average power cost of up to 8.2 mW. On the same platform, we estimate an energy cost for on-device training 10x lower than the labeling energy if sampling a new utterance every 6.1 s or 18.8 s with a DS-CNN-S or a DS-CNN-M model. Our empirical result paves the way to self-adaptive personalized KWS sensors at the extreme edge.
Paper Structure (23 sections, 9 equations, 10 figures, 4 tables)

This paper contains 23 sections, 9 equations, 10 figures, 4 tables.

Figures (10)

  • Figure 1: Self-Learning framework for On-device Personalized KWS. (1) A first on-device calibration function takes the examples provided by the user and returns the threshold parameters for the labeling task. (2) The labeling task processes the audio signal to detect and store pseudo-labeled samples. (3) Eventually, the new dataset is used to incrementally train the DNN feature extractor.
  • Figure 2: Audio Sensor node comprising a Vesper Mic and the GAP9 MCU, and an external RAM memory for the training task. The audio data is continuously transferred into the circular buffer of the on-chip RAM memory. When full, the FC wakes-up the Cluster, which processes the data using the general-purpose cores and the convolutional accelerator.
  • Figure 3: Pseudo code of the labeling task running on the GAP9' FC core.
  • Figure 4: Recording setup for the HeySnips-REC dataset. The speaker on the right plays the audio files of the HeySnips dataset. Our audio sensor node (left-side) is used for recording.
  • Figure 5: Accuracies on HeySnapdragon (left) and HeySnips (right) when varying the threshold parameters $\tau_L$ and $\tau_H$.
  • ...and 5 more figures