Table of Contents
Fetching ...

IITR-CIOL@NLU of Devanagari Script Languages 2025: Multilingual Hate Speech Detection and Target Identification in Devanagari-Scripted Languages

Siddhant Gupta, Siddh Singhal, Azmine Toushik Wasi

TL;DR

The paper addresses hate-speech detection and target identification in Devanagari-script languages by introducing MultilingualRobertaClass, a classifier built on the pretrained ia-multilingual-transliterated-roberta to handle multilingual and transliterated text. It reports strong Subtask B performance (accuracy around 0.884 with balanced precision-recall) and more challenging Subtask C performance (accuracy around 0.661), highlighting the relative tractability of detection versus targeting in multilingual scripts. The model blends contextual multilingual embeddings with a compact classifier head and demonstrates through ablations that sequence length substantially impacts results. This work advances moderation capabilities for South Asian multilingual content and outlines future avenues to improve target identification in sociolinguistically diverse contexts.

Abstract

This work focuses on two subtasks related to hate speech detection and target identification in Devanagari-scripted languages, specifically Hindi, Marathi, Nepali, Bhojpuri, and Sanskrit. Subtask B involves detecting hate speech in online text, while Subtask C requires identifying the specific targets of hate speech, such as individuals, organizations, or communities. We propose the MultilingualRobertaClass model, a deep neural network built on the pretrained multilingual transformer model ia-multilingual-transliterated-roberta, optimized for classification tasks in multilingual and transliterated contexts. The model leverages contextualized embeddings to handle linguistic diversity, with a classifier head for binary classification. We received 88.40% accuracy in Subtask B and 66.11% accuracy in Subtask C, in the test set.

IITR-CIOL@NLU of Devanagari Script Languages 2025: Multilingual Hate Speech Detection and Target Identification in Devanagari-Scripted Languages

TL;DR

The paper addresses hate-speech detection and target identification in Devanagari-script languages by introducing MultilingualRobertaClass, a classifier built on the pretrained ia-multilingual-transliterated-roberta to handle multilingual and transliterated text. It reports strong Subtask B performance (accuracy around 0.884 with balanced precision-recall) and more challenging Subtask C performance (accuracy around 0.661), highlighting the relative tractability of detection versus targeting in multilingual scripts. The model blends contextual multilingual embeddings with a compact classifier head and demonstrates through ablations that sequence length substantially impacts results. This work advances moderation capabilities for South Asian multilingual content and outlines future avenues to improve target identification in sociolinguistically diverse contexts.

Abstract

This work focuses on two subtasks related to hate speech detection and target identification in Devanagari-scripted languages, specifically Hindi, Marathi, Nepali, Bhojpuri, and Sanskrit. Subtask B involves detecting hate speech in online text, while Subtask C requires identifying the specific targets of hate speech, such as individuals, organizations, or communities. We propose the MultilingualRobertaClass model, a deep neural network built on the pretrained multilingual transformer model ia-multilingual-transliterated-roberta, optimized for classification tasks in multilingual and transliterated contexts. The model leverages contextualized embeddings to handle linguistic diversity, with a classifier head for binary classification. We received 88.40% accuracy in Subtask B and 66.11% accuracy in Subtask C, in the test set.

Paper Structure

This paper contains 11 sections, 1 figure, 3 tables.

Figures (1)

  • Figure 1: Model architecture, containing tokenizer, pre-trained model, classifier and other components