Boosting keyword spotting through on-device learnable user speech characteristics
Cristian Cioflan, Lukas Cavigelli, Luca Benini
TL;DR
The paper tackles the challenge of adapting keyword spotting systems to individual users under TinyML constraints by freezing a pretrained backbone and learning a lightweight user embedding, fused with backbone features at the classifier input. The proposed method enables on-device, few-shot adaptation that captures user-specific speech characteristics, achieving up to a 19% relative reduction in error on the Google Speech Commands 35-class task and substantial energy/memory efficiency. Key contributions include a detailed exploration of embedding fusion strategies (e.g., multiplicative fusion), analysis of training costs across backbones, and demonstration that learning embeddings can outperform backbone-only fine-tuning while requiring orders of magnitude less memory. The approach promises practical speaker-aware KWS on ultra-low-power devices, with memory footprints under 16 kB and per-epoch energy around 13 μJ, making it well-suited for extreme edge deployments.
Abstract
Keyword spotting systems for always-on TinyML-constrained applications require on-site tuning to boost the accuracy of offline trained classifiers when deployed in unseen inference conditions. Adapting to the speech peculiarities of target users requires many in-domain samples, often unavailable in real-world scenarios. Furthermore, current on-device learning techniques rely on computationally intensive and memory-hungry backbone update schemes, unfit for always-on, battery-powered devices. In this work, we propose a novel on-device learning architecture, composed of a pretrained backbone and a user-aware embedding learning the user's speech characteristics. The so-generated features are fused and used to classify the input utterance. For domain shifts generated by unseen speakers, we measure error rate reductions of up to 19% from 30.1% to 24.3% based on the 35-class problem of the Google Speech Commands dataset, through the inexpensive update of the user projections. We moreover demonstrate the few-shot learning capabilities of our proposed architecture in sample- and class-scarce learning conditions. With 23.7 kparameters and 1 MFLOP per epoch required for on-device training, our system is feasible for TinyML applications aimed at battery-powered microcontrollers.
