Table of Contents
Fetching ...

On-Device Domain Learning for Keyword Spotting on Low-Power Extreme Edge Embedded Systems

Cristian Cioflan, Lukas Cavigelli, Manuele Rusci, Miguel de Prado, Luca Benini

TL;DR

This work presents a fully on-device domain adaptation approach for keyword spotting on ultra-low-power edge devices. By tailoring a pretrained, noise-robust backbone with on-site, noise-aware fine-tuning of a small learnable classifier, it achieves substantial accuracy gains (up to 14%) in unseen noisy environments while operating under strict memory (<10 kB) and energy constraints. The authors propose resource-aware strategies (partial freezing, data-subset training) and demonstrate the method on the GAP9 platform, accomplishing on-site adaptation in about 14 s with modest energy and memory footprints. This enables private, low-power, always-on keyword spotting with adaptive robustness to real-world noise. It contributes a practical TinyML workflow for on-device learning and domain adaptation with concrete hardware demonstrations.

Abstract

Keyword spotting accuracy degrades when neural networks are exposed to noisy environments. On-site adaptation to previously unseen noise is crucial to recovering accuracy loss, and on-device learning is required to ensure that the adaptation process happens entirely on the edge device. In this work, we propose a fully on-device domain adaptation system achieving up to 14% accuracy gains over already-robust keyword spotting models. We enable on-device learning with less than 10 kB of memory, using only 100 labeled utterances to recover 5% accuracy after adapting to the complex speech noise. We demonstrate that domain adaptation can be achieved on ultra-low-power microcontrollers with as little as 806 mJ in only 14 s on always-on, battery-operated devices.

On-Device Domain Learning for Keyword Spotting on Low-Power Extreme Edge Embedded Systems

TL;DR

This work presents a fully on-device domain adaptation approach for keyword spotting on ultra-low-power edge devices. By tailoring a pretrained, noise-robust backbone with on-site, noise-aware fine-tuning of a small learnable classifier, it achieves substantial accuracy gains (up to 14%) in unseen noisy environments while operating under strict memory (<10 kB) and energy constraints. The authors propose resource-aware strategies (partial freezing, data-subset training) and demonstrate the method on the GAP9 platform, accomplishing on-site adaptation in about 14 s with modest energy and memory footprints. This enables private, low-power, always-on keyword spotting with adaptive robustness to real-world noise. It contributes a practical TinyML workflow for on-device learning and domain adaptation with concrete hardware demonstrations.

Abstract

Keyword spotting accuracy degrades when neural networks are exposed to noisy environments. On-site adaptation to previously unseen noise is crucial to recovering accuracy loss, and on-device learning is required to ensure that the adaptation process happens entirely on the edge device. In this work, we propose a fully on-device domain adaptation system achieving up to 14% accuracy gains over already-robust keyword spotting models. We enable on-device learning with less than 10 kB of memory, using only 100 labeled utterances to recover 5% accuracy after adapting to the complex speech noise. We demonstrate that domain adaptation can be achieved on ultra-low-power microcontrollers with as little as 806 mJ in only 14 s on always-on, battery-operated devices.
Paper Structure (11 sections, 1 equation, 2 figures, 2 tables)

This paper contains 11 sections, 1 equation, 2 figures, 2 tables.

Figures (2)

  • Figure 1: Overview of the proposed pipeline. During inference, the on-site recording is processed and its cepstral coefficients are computed, which are then fed to a frozen backbone. The classifier is used to predict the uttered keyword. During training, the on-site recording augments a prerecorded utterance. The prediction and the ground truth are jointly used to update the classifier parameters.
  • Figure 2: Mitigating resource constraints during ODDA considering speech noise. (left) represents the storage cost given the model (i.e., S, M, L) and the amount of randomly selected data used, while (right) shows the memory cost given the number of updated layers of S, starting with the last linear (fc1) layer, for a batch size of two. The horizontal brown lines represent, starting with the lowest one, the baseline accuracy obtained with S, M, and L.