Table of Contents
Fetching ...

An Investigation Into Explainable Audio Hate Speech Detection

Jinmyeong An, Wonjun Lee, Yejin Jeon, Jungseul Ok, Yunsu Kim, Gary Geunbae Lee

TL;DR

This work introduces a new task within the audio hate speech detection task domain, and proposes two different approaches, cascading and End-to-End (E2E), which find that the E2E approach outperforms the cascading method in terms of audio frame Intersection over Union metric.

Abstract

Research on hate speech has predominantly revolved around detection and interpretation from textual inputs, leaving verbal content largely unexplored. While there has been limited exploration into hate speech detection within verbal acoustic speech inputs, the aspect of interpretability has been overlooked. Therefore, we introduce a new task of explainable audio hate speech detection. Specifically, we aim to identify the precise time intervals, referred to as audio frame-level rationales, which serve as evidence for hate speech classification. Towards this end, we propose two different approaches: cascading and End-to-End (E2E). The cascading approach initially converts audio to transcripts, identifies hate speech within these transcripts, and subsequently locates the corresponding audio time frames. Conversely, the E2E approach processes audio utterances directly, which allows it to pinpoint hate speech within specific time frames. Additionally, due to the lack of explainable audio hate speech datasets that include audio frame-level rationales, we curated a synthetic audio dataset to train our models. We further validated these models on actual human speech utterances and found that the E2E approach outperforms the cascading method in terms of the audio frame Intersection over Union (IoU) metric. Furthermore, we observed that including frame-level rationales significantly enhances hate speech detection accuracy for the E2E approach. \textbf{Disclaimer} The reader may encounter content of an offensive or hateful nature. However, given the nature of the work, this cannot be avoided.

An Investigation Into Explainable Audio Hate Speech Detection

TL;DR

This work introduces a new task within the audio hate speech detection task domain, and proposes two different approaches, cascading and End-to-End (E2E), which find that the E2E approach outperforms the cascading method in terms of audio frame Intersection over Union metric.

Abstract

Research on hate speech has predominantly revolved around detection and interpretation from textual inputs, leaving verbal content largely unexplored. While there has been limited exploration into hate speech detection within verbal acoustic speech inputs, the aspect of interpretability has been overlooked. Therefore, we introduce a new task of explainable audio hate speech detection. Specifically, we aim to identify the precise time intervals, referred to as audio frame-level rationales, which serve as evidence for hate speech classification. Towards this end, we propose two different approaches: cascading and End-to-End (E2E). The cascading approach initially converts audio to transcripts, identifies hate speech within these transcripts, and subsequently locates the corresponding audio time frames. Conversely, the E2E approach processes audio utterances directly, which allows it to pinpoint hate speech within specific time frames. Additionally, due to the lack of explainable audio hate speech datasets that include audio frame-level rationales, we curated a synthetic audio dataset to train our models. We further validated these models on actual human speech utterances and found that the E2E approach outperforms the cascading method in terms of the audio frame Intersection over Union (IoU) metric. Furthermore, we observed that including frame-level rationales significantly enhances hate speech detection accuracy for the E2E approach. \textbf{Disclaimer} The reader may encounter content of an offensive or hateful nature. However, given the nature of the work, this cannot be avoided.
Paper Structure (18 sections, 4 equations, 5 figures, 8 tables)

This paper contains 18 sections, 4 equations, 5 figures, 8 tables.

Figures (5)

  • Figure 1: Overview of the cascaded method. Boxes highlighted in yellow indicate model outputs.
  • Figure 2: Overview of E2E model. Boxes highlighted in yellow indicate model outputs (AHS-CLS and AHS-FD).
  • Figure 3: Comparison of IoU scores on human recording test data within three different WER ranges.
  • Figure 4: Impact of ASR error for IoU score in cascaded method. The numbers in parentheses represent the total number of parameters in different ASR (Whisper) models.
  • Figure 5: Visualization of audio hate speech frame prediction for E2E and cascading models. Blue letters and graphs indicate the ground truth transcript and rationale, while red letters and graphs show the values predicted by the model. The green part represents the range of the time frame that the model actually predicts.