Table of Contents
Fetching ...

SLiCK: Exploiting Subsequences for Length-Constrained Keyword Spotting

Kumari Nishu, Minsik Cho, Devang Naik

TL;DR

This work shows that user-defined keyword spotting can be treated as a length-constrained problem, eliminating the need for aggregation over variable text length, and introduces a subsequence-level matching scheme to learn audio-text relations at a finer granularity.

Abstract

User-defined keyword spotting on a resource-constrained edge device is challenging. However, keywords are often bounded by a maximum keyword length, which has been largely under-leveraged in prior works. Our analysis of keyword-length distribution shows that user-defined keyword spotting can be treated as a length-constrained problem, eliminating the need for aggregation over variable text length. This leads to our proposed method for efficient keyword spotting, SLiCK (exploiting Subsequences for Length-Constrained Keyword spotting). We further introduce a subsequence-level matching scheme to learn audio-text relations at a finer granularity, thus distinguishing similar-sounding keywords more effectively through enhanced context. In SLiCK, the model is trained with a multi-task learning approach using two modules: Matcher (utterance-level matching task, novel subsequence-level matching task) and Encoder (phoneme recognition task). The proposed method improves the baseline results on Libriphrase hard dataset, increasing AUC from $88.52$ to $94.9$ and reducing EER from $18.82$ to $11.1$.

SLiCK: Exploiting Subsequences for Length-Constrained Keyword Spotting

TL;DR

This work shows that user-defined keyword spotting can be treated as a length-constrained problem, eliminating the need for aggregation over variable text length, and introduces a subsequence-level matching scheme to learn audio-text relations at a finer granularity.

Abstract

User-defined keyword spotting on a resource-constrained edge device is challenging. However, keywords are often bounded by a maximum keyword length, which has been largely under-leveraged in prior works. Our analysis of keyword-length distribution shows that user-defined keyword spotting can be treated as a length-constrained problem, eliminating the need for aggregation over variable text length. This leads to our proposed method for efficient keyword spotting, SLiCK (exploiting Subsequences for Length-Constrained Keyword spotting). We further introduce a subsequence-level matching scheme to learn audio-text relations at a finer granularity, thus distinguishing similar-sounding keywords more effectively through enhanced context. In SLiCK, the model is trained with a multi-task learning approach using two modules: Matcher (utterance-level matching task, novel subsequence-level matching task) and Encoder (phoneme recognition task). The proposed method improves the baseline results on Libriphrase hard dataset, increasing AUC from to and reducing EER from to .
Paper Structure (10 sections, 2 equations, 3 figures, 2 tables)

This paper contains 10 sections, 2 equations, 3 figures, 2 tables.

Figures (3)

  • Figure 1: Overall architecture of the proposed method SLiCK (exploiting Subsequences for Length-Constrained Keyword spotting). The model is trained with a multi-task learning approach: Matcher (utterance-level matching ${\cal L}_{utt}$, novel subsequence-level matching ${\cal L}_{ss}$) and Encoder (phoneme recognition ${\cal L}_{CTC}$). We only use the blue parts for inference.
  • Figure 2: Visualization of keyword-length (no. of phonemes) and their commonality in the datasets. Nearly all keywords are bounded by a maximum length of $25$.
  • Figure 3: Visualization of predictions from our novel subsequence-level matching scheme for the anchor text service across three samples - (a) a positive pair, (b) an easy negative pair with nervous, and (c) a hard negative pair with surface.