Table of Contents
Fetching ...

Attentive Fusion: A Transformer-based Approach to Multimodal Hate Speech Detection

Atanu Mandal, Gargi Roy, Amit Barman, Indranil Dutta, Sudip Kumar Naskar

TL;DR

This work tackles hate speech detection by fusing audio and text signals in a Transformer-based architecture. It introduces an Attentive Fusion layer to jointly integrate two processing pipelines, enabling cross-modal learning and attention-driven combination. The model achieves a macro F1 of $0.927$ on the test set, outperforming audio-only baselines and several baselines in prior work. The approach demonstrates the value of multimodal signals for capturing nuanced cues like sarcasm and tone, with potential applicability beyond English, pending multilingual evaluation.

Abstract

With the recent surge and exponential growth of social media usage, scrutinizing social media content for the presence of any hateful content is of utmost importance. Researchers have been diligently working since the past decade on distinguishing between content that promotes hatred and content that does not. Traditionally, the main focus has been on analyzing textual content. However, recent research attempts have also commenced into the identification of audio-based content. Nevertheless, studies have shown that relying solely on audio or text-based content may be ineffective, as recent upsurge indicates that individuals often employ sarcasm in their speech and writing. To overcome these challenges, we present an approach to identify whether a speech promotes hate or not utilizing both audio and textual representations. Our methodology is based on the Transformer framework that incorporates both audio and text sampling, accompanied by our very own layer called "Attentive Fusion". The results of our study surpassed previous state-of-the-art techniques, achieving an impressive macro F1 score of 0.927 on the Test Set.

Attentive Fusion: A Transformer-based Approach to Multimodal Hate Speech Detection

TL;DR

This work tackles hate speech detection by fusing audio and text signals in a Transformer-based architecture. It introduces an Attentive Fusion layer to jointly integrate two processing pipelines, enabling cross-modal learning and attention-driven combination. The model achieves a macro F1 of on the test set, outperforming audio-only baselines and several baselines in prior work. The approach demonstrates the value of multimodal signals for capturing nuanced cues like sarcasm and tone, with potential applicability beyond English, pending multilingual evaluation.

Abstract

With the recent surge and exponential growth of social media usage, scrutinizing social media content for the presence of any hateful content is of utmost importance. Researchers have been diligently working since the past decade on distinguishing between content that promotes hatred and content that does not. Traditionally, the main focus has been on analyzing textual content. However, recent research attempts have also commenced into the identification of audio-based content. Nevertheless, studies have shown that relying solely on audio or text-based content may be ineffective, as recent upsurge indicates that individuals often employ sarcasm in their speech and writing. To overcome these challenges, we present an approach to identify whether a speech promotes hate or not utilizing both audio and textual representations. Our methodology is based on the Transformer framework that incorporates both audio and text sampling, accompanied by our very own layer called "Attentive Fusion". The results of our study surpassed previous state-of-the-art techniques, achieving an impressive macro F1 score of 0.927 on the Test Set.
Paper Structure (26 sections, 4 equations, 8 figures, 6 tables)

This paper contains 26 sections, 4 equations, 8 figures, 6 tables.

Figures (8)

  • Figure 1: Identification of "Hate" or "Not Hate" using multimodality approach
  • Figure 2: Pictorial representation of the contribution of datasets
  • Figure 3: Sample count for "Hate" and "Not Hate"
  • Figure 4: Sample count for "Hate" and "Not Hate"
  • Figure 5: Scatter representation of Datasets according to audio length
  • ...and 3 more figures