Table of Contents
Fetching ...

Comparative Analysis of Personalized Voice Activity Detection Systems: Assessing Real-World Effectiveness

Satyam Kumar, Sai Srujana Buddi, Utkarsh Oggy Sarawgi, Vineet Garg, Shivesh Ranjan, Ognjen, Rudovic, Ahmed Hussen Abdelaziz, Saurabh Adya

TL;DR

Personalized Voice Activity Detection (PVAD) aims to detect speech from a target speaker in multi-user environments with real-time constraints. The paper systematically compares five PVAD fusion variants—DSC, EF, LF, CLF, and DCLF—using LibriSpeech with MUSAN augmentation and enrollment-based d-vectors, evaluating frame- and utterance-level accuracy, latency, and user-level consistency. Key findings show that lightweight end-to-end PVAD models outperform the DSC baseline, with FiLM-based CLF offering the best frame-level latency and utterance-level EER, while DCLF excels in utterance-level detection accuracy; both CLF and DCLF achieve substantial model-size reductions, enabling on-device deployment. These results demonstrate PVAD’s practicality for real-world downstream tasks and highlight the value of dynamic embeddings for robust, low-latency speaker-specific speech detection across devices and scenarios.

Abstract

Voice activity detection (VAD) is a critical component in various applications such as speech recognition, speech enhancement, and hands-free communication systems. With the increasing demand for personalized and context-aware technologies, the need for effective personalized VAD systems has become paramount. In this paper, we present a comparative analysis of Personalized Voice Activity Detection (PVAD) systems to assess their real-world effectiveness. We introduce a comprehensive approach to assess PVAD systems, incorporating various performance metrics such as frame-level and utterance-level error rates, detection latency and accuracy, alongside user-level analysis. Through extensive experimentation and evaluation, we provide a thorough understanding of the strengths and limitations of various PVAD variants. This paper advances the understanding of PVAD technology by offering insights into its efficacy and viability in practical applications using a comprehensive set of metrics.

Comparative Analysis of Personalized Voice Activity Detection Systems: Assessing Real-World Effectiveness

TL;DR

Personalized Voice Activity Detection (PVAD) aims to detect speech from a target speaker in multi-user environments with real-time constraints. The paper systematically compares five PVAD fusion variants—DSC, EF, LF, CLF, and DCLF—using LibriSpeech with MUSAN augmentation and enrollment-based d-vectors, evaluating frame- and utterance-level accuracy, latency, and user-level consistency. Key findings show that lightweight end-to-end PVAD models outperform the DSC baseline, with FiLM-based CLF offering the best frame-level latency and utterance-level EER, while DCLF excels in utterance-level detection accuracy; both CLF and DCLF achieve substantial model-size reductions, enabling on-device deployment. These results demonstrate PVAD’s practicality for real-world downstream tasks and highlight the value of dynamic embeddings for robust, low-latency speaker-specific speech detection across devices and scenarios.

Abstract

Voice activity detection (VAD) is a critical component in various applications such as speech recognition, speech enhancement, and hands-free communication systems. With the increasing demand for personalized and context-aware technologies, the need for effective personalized VAD systems has become paramount. In this paper, we present a comparative analysis of Personalized Voice Activity Detection (PVAD) systems to assess their real-world effectiveness. We introduce a comprehensive approach to assess PVAD systems, incorporating various performance metrics such as frame-level and utterance-level error rates, detection latency and accuracy, alongside user-level analysis. Through extensive experimentation and evaluation, we provide a thorough understanding of the strengths and limitations of various PVAD variants. This paper advances the understanding of PVAD technology by offering insights into its efficacy and viability in practical applications using a comprehensive set of metrics.
Paper Structure (24 sections, 3 figures, 3 tables)

This paper contains 24 sections, 3 figures, 3 tables.

Figures (3)

  • Figure 1: Pictorial depiction of different fusion strategies for personalization in Voice Activity Detection Systems.
  • Figure 2: Impact of audio duration on detection accuracy
  • Figure 3: User-level analysis of a) detection latency, b) accuracy of target speaker presence.