Table of Contents
Fetching ...

The Reasonable Effectiveness of Speaker Embeddings for Violence Detection

Sarthak Jain, Orchid Chetia Phukan, Arun Balaji Buduru, Rajesh Sharma

TL;DR

The paper addresses the challenge of enforcing violence detection from audio under compute constraints, where large SSL PTMs are impractical. It compares SSL pre-trained models with compact speaker-recognition embeddings (x-vector, ECAPA) as feature extractors, using RF and SVM classifiers on a violence vs non-violence audio dataset, with 5-fold cross-validation. The results show that speaker-recognition embeddings, especially x-vector with Random Forest, achieve top performance and reach state-of-the-art levels on the evaluated data. Additionally, the authors present a deployment-ready application with a React/Flask interface that achieves about 1 second of inference for a 1-minute audio, demonstrating practical real-time violence detection in surveillance contexts without relying on visual data.

Abstract

In this paper, we focus on audio violence detection (AVD). AVD is necessary for several reasons, especially in the context of maintaining safety, preventing harm, and ensuring security in various environments. This calls for accurate AVD systems. Like many related applications in audio processing, the most common approach for improving the performance, would be by leveraging self-supervised (SSL) pre-trained models (PTMs). However, as these SSL models are very large models with million of parameters and this can hinder real-world deployment especially in compute-constraint environment. To resolve this, we propose the usage of speaker recognition models which are much smaller compared to the SSL models. Experimentation with speaker recognition model embeddings with SVM & Random Forest as classifiers, we show that speaker recognition model embeddings perform the best in comparison to state-of-the-art (SOTA) SSL models and achieve SOTA results.

The Reasonable Effectiveness of Speaker Embeddings for Violence Detection

TL;DR

The paper addresses the challenge of enforcing violence detection from audio under compute constraints, where large SSL PTMs are impractical. It compares SSL pre-trained models with compact speaker-recognition embeddings (x-vector, ECAPA) as feature extractors, using RF and SVM classifiers on a violence vs non-violence audio dataset, with 5-fold cross-validation. The results show that speaker-recognition embeddings, especially x-vector with Random Forest, achieve top performance and reach state-of-the-art levels on the evaluated data. Additionally, the authors present a deployment-ready application with a React/Flask interface that achieves about 1 second of inference for a 1-minute audio, demonstrating practical real-time violence detection in surveillance contexts without relying on visual data.

Abstract

In this paper, we focus on audio violence detection (AVD). AVD is necessary for several reasons, especially in the context of maintaining safety, preventing harm, and ensuring security in various environments. This calls for accurate AVD systems. Like many related applications in audio processing, the most common approach for improving the performance, would be by leveraging self-supervised (SSL) pre-trained models (PTMs). However, as these SSL models are very large models with million of parameters and this can hinder real-world deployment especially in compute-constraint environment. To resolve this, we propose the usage of speaker recognition models which are much smaller compared to the SSL models. Experimentation with speaker recognition model embeddings with SVM & Random Forest as classifiers, we show that speaker recognition model embeddings perform the best in comparison to state-of-the-art (SOTA) SSL models and achieve SOTA results.
Paper Structure (3 sections, 1 figure, 1 table)

This paper contains 3 sections, 1 figure, 1 table.

Figures (1)

  • Figure 1: Audio Violence Detection Application; Application architecture pipeline is presented in Figure \ref{['fig:my_label']} (a) with the flow of information when the input audio is provided to the final inference received by the user through the user interface; Figure 1(c) shows the confusion matrix of the best model RF with x-vector embeddings