Table of Contents
Fetching ...

Text-Aware Adapter for Few-Shot Keyword Spotting

Youngmoon Jung, Jinyoung Lee, Seungjin Lee, Myunghun Jung, Yong-Hyeok Lee, Hoon-Young Cho

TL;DR

The paper addresses improving few-shot keyword spotting for text-enrolled flexible KWS by introducing a TA-adapter that selectively tunes a small portion of the acoustic encoder while leveraging a fixed text embedding (TE) as a keyword representation. TA-adapter combines text-conditioned feature modulation (TCFM) via learnable activations, a feature-weight adapter (FW-adapter) that tunes BN/SE blocks, and a TE-based classifier, yielding substantial gains with minimal parameter overhead. Empirical results on Google Speech Commands V2 show significant average-precision improvements (up to ~$87.22\%$ AP) with only ~0.14% extra parameters, outperforming several baselines including AdaKWS and RPL, and enabling rapid keyword-specific adaptation. The approach enables practical, data-efficient personalization for TF-KWS, with potential zero-shot extensions via TTS explored for future work.

Abstract

Recent advances in flexible keyword spotting (KWS) with text enrollment allow users to personalize keywords without uttering them during enrollment. However, there is still room for improvement in target keyword performance. In this work, we propose a novel few-shot transfer learning method, called text-aware adapter (TA-adapter), designed to enhance a pre-trained flexible KWS model for specific keywords with limited speech samples. To adapt the acoustic encoder, we leverage a jointly pre-trained text encoder to generate a text embedding that acts as a representative vector for the keyword. By fine-tuning only a small portion of the network while keeping the core components' weights intact, the TA-adapter proves highly efficient for few-shot KWS, enabling a seamless return to the original pre-trained model. In our experiments, the TA-adapter demonstrated significant performance improvements across 35 distinct keywords from the Google Speech Commands V2 dataset, with only a 0.14% increase in the total number of parameters.

Text-Aware Adapter for Few-Shot Keyword Spotting

TL;DR

The paper addresses improving few-shot keyword spotting for text-enrolled flexible KWS by introducing a TA-adapter that selectively tunes a small portion of the acoustic encoder while leveraging a fixed text embedding (TE) as a keyword representation. TA-adapter combines text-conditioned feature modulation (TCFM) via learnable activations, a feature-weight adapter (FW-adapter) that tunes BN/SE blocks, and a TE-based classifier, yielding substantial gains with minimal parameter overhead. Empirical results on Google Speech Commands V2 show significant average-precision improvements (up to ~ AP) with only ~0.14% extra parameters, outperforming several baselines including AdaKWS and RPL, and enabling rapid keyword-specific adaptation. The approach enables practical, data-efficient personalization for TF-KWS, with potential zero-shot extensions via TTS explored for future work.

Abstract

Recent advances in flexible keyword spotting (KWS) with text enrollment allow users to personalize keywords without uttering them during enrollment. However, there is still room for improvement in target keyword performance. In this work, we propose a novel few-shot transfer learning method, called text-aware adapter (TA-adapter), designed to enhance a pre-trained flexible KWS model for specific keywords with limited speech samples. To adapt the acoustic encoder, we leverage a jointly pre-trained text encoder to generate a text embedding that acts as a representative vector for the keyword. By fine-tuning only a small portion of the network while keeping the core components' weights intact, the TA-adapter proves highly efficient for few-shot KWS, enabling a seamless return to the original pre-trained model. In our experiments, the TA-adapter demonstrated significant performance improvements across 35 distinct keywords from the Google Speech Commands V2 dataset, with only a 0.14% increase in the total number of parameters.

Paper Structure

This paper contains 10 sections, 3 equations, 3 figures, 4 tables.

Figures (3)

  • Figure 1: Overall architecture of text-aware adapter (TA-adapter). $\bm{t}$ and $\bm{x}$ represent input text and speech associated with keyword $k$. The red line indicates text embedding (TE) classifier and text-conditioned feature modulation (TCFM).
  • Figure 2: Comparison between (a) AdaIN-based conditioning and (b) text-conditioned feature modulation (TCFM).
  • Figure 3: Plot of normalized outputs of trained LAFs from G4 and G5 conditioned on TEs extracted from six keywords ('backward', 'happy', 'house', 'bird', 'cat', and 'down'). The plots emphasize that LAF exhibit varying profiles across different keywords and layers.