Table of Contents
Fetching ...

Streaming Keyword Spotting Boosted by Cross-layer Discrimination Consistency

Yu Xi, Haoyu Li, Xiaoyu Gu, Hao Li, Yidi Jiang, Kai Yu

TL;DR

This work tackles wake-word detection with a streaming, non-autoregressive CTC-based framework. It introduces a frame-synchronous streaming decoding algorithm that confines the search to the keyword, avoiding full ASR or WFST graphs, and enhances discrimination with Cross-layer Discrimination Consistency (CDC) that leverages intermediate and final CTC branches. By integrating Intermediate CTC regularization and a CDC-based refinement, the method achieves substantial improvements over ASR and graph-based baselines, including robust performance under noisy and out-of-domain conditions, with low false-alarm rates. The approach is practical, easy to implement, and demonstrates strong potential for deployment on resource-constrained devices.

Abstract

Connectionist Temporal Classification (CTC), a non-autoregressive training criterion, is widely used in online keyword spotting (KWS). However, existing CTC-based KWS decoding strategies either rely on Automatic Speech Recognition (ASR), which performs suboptimally due to its broad search over the acoustic space without keyword-specific optimization, or on KWS-specific decoding graphs, which are complex to implement and maintain. In this work, we propose a streaming decoding algorithm enhanced by Cross-layer Discrimination Consistency (CDC), tailored for CTC-based KWS. Specifically, we introduce a streamlined yet effective decoding algorithm capable of detecting the start of the keyword at any arbitrary position. Furthermore, we leverage discrimination consistency information across layers to better differentiate between positive and false alarm samples. Our experiments on both clean and noisy Hey Snips datasets show that the proposed streaming decoding strategy outperforms ASR-based and graph-based KWS baselines. The CDC-boosted decoding further improves performance, yielding an average absolute recall improvement of 6.8% and a 46.3% relative reduction in the miss rate compared to the graph-based KWS baseline, with a very low false alarm rate of 0.05 per hour.

Streaming Keyword Spotting Boosted by Cross-layer Discrimination Consistency

TL;DR

This work tackles wake-word detection with a streaming, non-autoregressive CTC-based framework. It introduces a frame-synchronous streaming decoding algorithm that confines the search to the keyword, avoiding full ASR or WFST graphs, and enhances discrimination with Cross-layer Discrimination Consistency (CDC) that leverages intermediate and final CTC branches. By integrating Intermediate CTC regularization and a CDC-based refinement, the method achieves substantial improvements over ASR and graph-based baselines, including robust performance under noisy and out-of-domain conditions, with low false-alarm rates. The approach is practical, easy to implement, and demonstrates strong potential for deployment on resource-constrained devices.

Abstract

Connectionist Temporal Classification (CTC), a non-autoregressive training criterion, is widely used in online keyword spotting (KWS). However, existing CTC-based KWS decoding strategies either rely on Automatic Speech Recognition (ASR), which performs suboptimally due to its broad search over the acoustic space without keyword-specific optimization, or on KWS-specific decoding graphs, which are complex to implement and maintain. In this work, we propose a streaming decoding algorithm enhanced by Cross-layer Discrimination Consistency (CDC), tailored for CTC-based KWS. Specifically, we introduce a streamlined yet effective decoding algorithm capable of detecting the start of the keyword at any arbitrary position. Furthermore, we leverage discrimination consistency information across layers to better differentiate between positive and false alarm samples. Our experiments on both clean and noisy Hey Snips datasets show that the proposed streaming decoding strategy outperforms ASR-based and graph-based KWS baselines. The CDC-boosted decoding further improves performance, yielding an average absolute recall improvement of 6.8% and a 46.3% relative reduction in the miss rate compared to the graph-based KWS baseline, with a very low false alarm rate of 0.05 per hour.

Paper Structure

This paper contains 15 sections, 6 equations, 2 figures, 4 tables, 1 algorithm.

Figures (2)

  • Figure 1: An example of "Hey Snips" illustrates the training and decoding framework. $L$ represents the number of DFSMN layers. The CDC-boosted decoding includes a refinement stage. $\bm{s^{(init)}}$, $\bm{s^{(inter)}}$, $\bm{s^{(cdc)}}$, and $\bm{s^{(refine)}}$ denote the initial CTC decoding scores, ICTC decoding scores, CDC scores between initial and intermediate scores, and the final refined scores, respectively. $f_{CDC}$ represents the function to measure discrimination consitency. $L_{His.}$ and $L_{Fut.}$ refer to the history and future frame numbers used for CDC score computation.
  • Figure 2: The patterns of frame-level decoding scores are shown for positive (upper) and negative (lower) samples. Red boxes highlight changes in CDC scores near activation points. Notably, for positive samples, the scores consistently stay close to 1 around wake-up points, while negative samples show sharp variations.