Table of Contents
Fetching ...

Contrastive Learning With Audio Discrimination For Customizable Keyword Spotting In Continuous Speech

Yu Xi, Baochen Yang, Hao Li, Jiaqi Guo, Kai Yu

TL;DR

This work tackles customizable keyword spotting in continuous speech by introducing CLAD, a contrastive learning framework that jointly leverages audio-text matching and audio-audio discrimination. It uses sliding-window InfoNCE losses and a two-phase training pipeline (frame-level AM pretraining followed by CLAD) to produce robust keyword representations suitable for streaming inference. Results on LibriSpeech and LibriPhrase show that incorporating audio discrimination yields substantial accuracy gains and that the end-to-end CLAD approach offers significant speedups over traditional two-stage KWS systems. The method demonstrates strong performance for continuous KWS and practical potential for on-device deployment due to its compact model size and efficient inference.

Abstract

Customizable keyword spotting (KWS) in continuous speech has attracted increasing attention due to its real-world application potential. While contrastive learning (CL) has been widely used to extract keyword representations, previous CL approaches all operate on pre-segmented isolated words and employ only audio-text representations matching strategy. However, for KWS in continuous speech, co-articulation and streaming word segmentation can easily yield similar audio patterns for different texts, which may consequently trigger false alarms. To address this issue, we propose a novel CL with Audio Discrimination (CLAD) approach to learning keyword representation with both audio-text matching and audio-audio discrimination ability. Here, an InfoNCE loss considering both audio-audio and audio-text CL data pairs is employed for each sliding window during training. Evaluations on the open-source LibriPhrase dataset show that the use of sliding-window level InfoNCE loss yields comparable performance compared to previous CL approaches. Furthermore, experiments on the continuous speech dataset LibriSpeech demonstrate that, by incorporating audio discrimination, CLAD achieves significant performance gain over CL without audio discrimination. Meanwhile, compared to two-stage KWS approaches, the end-to-end KWS with CLAD achieves not only better performance, but also significant speed-up.

Contrastive Learning With Audio Discrimination For Customizable Keyword Spotting In Continuous Speech

TL;DR

This work tackles customizable keyword spotting in continuous speech by introducing CLAD, a contrastive learning framework that jointly leverages audio-text matching and audio-audio discrimination. It uses sliding-window InfoNCE losses and a two-phase training pipeline (frame-level AM pretraining followed by CLAD) to produce robust keyword representations suitable for streaming inference. Results on LibriSpeech and LibriPhrase show that incorporating audio discrimination yields substantial accuracy gains and that the end-to-end CLAD approach offers significant speedups over traditional two-stage KWS systems. The method demonstrates strong performance for continuous KWS and practical potential for on-device deployment due to its compact model size and efficient inference.

Abstract

Customizable keyword spotting (KWS) in continuous speech has attracted increasing attention due to its real-world application potential. While contrastive learning (CL) has been widely used to extract keyword representations, previous CL approaches all operate on pre-segmented isolated words and employ only audio-text representations matching strategy. However, for KWS in continuous speech, co-articulation and streaming word segmentation can easily yield similar audio patterns for different texts, which may consequently trigger false alarms. To address this issue, we propose a novel CL with Audio Discrimination (CLAD) approach to learning keyword representation with both audio-text matching and audio-audio discrimination ability. Here, an InfoNCE loss considering both audio-audio and audio-text CL data pairs is employed for each sliding window during training. Evaluations on the open-source LibriPhrase dataset show that the use of sliding-window level InfoNCE loss yields comparable performance compared to previous CL approaches. Furthermore, experiments on the continuous speech dataset LibriSpeech demonstrate that, by incorporating audio discrimination, CLAD achieves significant performance gain over CL without audio discrimination. Meanwhile, compared to two-stage KWS approaches, the end-to-end KWS with CLAD achieves not only better performance, but also significant speed-up.
Paper Structure (17 sections, 4 equations, 2 figures, 3 tables)

This paper contains 17 sections, 4 equations, 2 figures, 3 tables.

Figures (2)

  • Figure 1: The overview of the whole framework.
  • Figure 2: Results of models trained by audio-audio and audio-text pairs or only audio-text pairs. "Clean" means the results of the test-clean dataset, and "Other" means the results of the test-other dataset. "aa" denotes CLAD, while "at" denotes CL without AD.