Table of Contents
Fetching ...

PCOV-KWS: Multi-task Learning for Personalized Customizable Open Vocabulary Keyword Spotting

Jianan Pan, Kejie Huang

Abstract

As advancements in technologies like Internet of Things (IoT), Automatic Speech Recognition (ASR), Speaker Verification (SV), and Text-to-Speech (TTS) lead to increased usage of intelligent voice assistants, the demand for privacy and personalization has escalated. In this paper, we introduce a multi-task learning framework for personalized, customizable open-vocabulary Keyword Spotting (PCOV-KWS). This framework employs a lightweight network to simultaneously perform Keyword Spotting (KWS) and SV to address personalized KWS requirements. We have integrated a training criterion distinct from softmax-based loss, transforming multi-class classification into multiple binary classifications, which eliminates inter-category competition, while an optimization strategy for multi-task loss weighting is employed during training. We evaluated our PCOV-KWS system in multiple datasets, demonstrating that it outperforms the baselines in evaluation results, while also requiring fewer parameters and lower computational resources.

PCOV-KWS: Multi-task Learning for Personalized Customizable Open Vocabulary Keyword Spotting

Abstract

As advancements in technologies like Internet of Things (IoT), Automatic Speech Recognition (ASR), Speaker Verification (SV), and Text-to-Speech (TTS) lead to increased usage of intelligent voice assistants, the demand for privacy and personalization has escalated. In this paper, we introduce a multi-task learning framework for personalized, customizable open-vocabulary Keyword Spotting (PCOV-KWS). This framework employs a lightweight network to simultaneously perform Keyword Spotting (KWS) and SV to address personalized KWS requirements. We have integrated a training criterion distinct from softmax-based loss, transforming multi-class classification into multiple binary classifications, which eliminates inter-category competition, while an optimization strategy for multi-task loss weighting is employed during training. We evaluated our PCOV-KWS system in multiple datasets, demonstrating that it outperforms the baselines in evaluation results, while also requiring fewer parameters and lower computational resources.
Paper Structure (18 sections, 10 equations, 3 figures, 4 tables)

This paper contains 18 sections, 10 equations, 3 figures, 4 tables.

Figures (3)

  • Figure 1: Proposed architecture of PCOV-KWS: The architecture comprises an audio encoder, which includes a shared encoder and two linear sub-encoders for KWS and SV respectively and cosine classifiers integrated with SphereFace 2 for metric learning.
  • Figure 2: Performance Analysis on Training Strategy
  • Figure 3: Evaluation results according to the number of words in a LibriPhrase evaluation set.