Table of Contents
Fetching ...

Synaspot: A Lightweight, Streaming Multi-modal Framework for Keyword Spotting with Audio-Text Synergy

Kewei Li, Yinan Zhong, Xiaotao Liang, Tianchi Dai, Shaofei Xue

TL;DR

This work tackles open-vocabulary keyword spotting in streaming, on-device scenarios by introducing Synaspot, a lightweight framework that fuses audio, text, and mixed enrollment representations. The model employs a DFSMN-based audio encoder trained to be speaker-robust, with contrastive alignment across modalities to unify embeddings in a shared space, and a streaming decoder that computes frame-level scores using enrollment representations. Key contributions include multimodal enrollment, an encoder-only streaming decoding paradigm, and comprehensive experiments showing competitive accuracy with a small parameter footprint on LibriPhase/LibriPhrase and Mandarin WenetPhrase. The results suggest practical, real-time open-vocabulary KWS with strong cross-modal integration and robustness to enrollment variability, suitable for on-device deployment.

Abstract

Open-vocabulary keyword spotting (KWS) in continuous speech streams holds significant practical value across a wide range of real-world applications. While increasing attention has been paid to the role of different modalities in KWS, their effectiveness has been acknowledged. However, the increased parameter cost from multimodal integration and the constraints of end-to-end deployment have limited the practical applicability of such models. To address these challenges, we propose a lightweight, streaming multi-modal framework. First, we focus on multimodal enrollment features and reduce speaker-specific (voiceprint) information in the speech enrollment to extract speaker-irrelevant characteristics. Second, we effectively fuse speech and text features. Finally, we introduce a streaming decoding framework that only requires the encoder to extract features, which are then mathematically decoded with our three modal representations. Experiments on LibriPhase and WenetPrase demonstrate the performance of our model. Compared to existing streaming approaches, our method achieves better performance with significantly fewer parameters.

Synaspot: A Lightweight, Streaming Multi-modal Framework for Keyword Spotting with Audio-Text Synergy

TL;DR

This work tackles open-vocabulary keyword spotting in streaming, on-device scenarios by introducing Synaspot, a lightweight framework that fuses audio, text, and mixed enrollment representations. The model employs a DFSMN-based audio encoder trained to be speaker-robust, with contrastive alignment across modalities to unify embeddings in a shared space, and a streaming decoder that computes frame-level scores using enrollment representations. Key contributions include multimodal enrollment, an encoder-only streaming decoding paradigm, and comprehensive experiments showing competitive accuracy with a small parameter footprint on LibriPhase/LibriPhrase and Mandarin WenetPhrase. The results suggest practical, real-time open-vocabulary KWS with strong cross-modal integration and robustness to enrollment variability, suitable for on-device deployment.

Abstract

Open-vocabulary keyword spotting (KWS) in continuous speech streams holds significant practical value across a wide range of real-world applications. While increasing attention has been paid to the role of different modalities in KWS, their effectiveness has been acknowledged. However, the increased parameter cost from multimodal integration and the constraints of end-to-end deployment have limited the practical applicability of such models. To address these challenges, we propose a lightweight, streaming multi-modal framework. First, we focus on multimodal enrollment features and reduce speaker-specific (voiceprint) information in the speech enrollment to extract speaker-irrelevant characteristics. Second, we effectively fuse speech and text features. Finally, we introduce a streaming decoding framework that only requires the encoder to extract features, which are then mathematically decoded with our three modal representations. Experiments on LibriPhase and WenetPrase demonstrate the performance of our model. Compared to existing streaming approaches, our method achieves better performance with significantly fewer parameters.

Paper Structure

This paper contains 14 sections, 7 equations, 2 figures, 3 tables.

Figures (2)

  • Figure 1: An overview of Synaspot. In ➀ the training phase, we first learn a speaker-irrelevant audio encoder and explicitly enlarge the inter-phoneme margins, then we obtain text & mixed embeddings and align these modalities in a shared embedding space via contrastive learning. In ➁ the inference & decoding phase, we perform a streaming and lightweight keyword spotting, assign each audio frame with scores.
  • Figure 2: Visualization of similarity heatmaps.