Table of Contents
Fetching ...

MUTEX: Leveraging Multilingual Transformers and Conditional Random Fields for Enhanced Urdu Toxic Span Detection

Inayat Arshad, Fajar Saleem, Ijaz Hussain

TL;DR

Results indicate that MUTEX achieves 60% token-level F1 score that is the first supervised baseline for Urdu toxic span detection, and reveals that transformer-based models are more effective at implicitly capturing the contextual toxicity and are able to address the issues of code-switching and morphological variation than other models.

Abstract

Urdu toxic span detection remains limited because most existing systems rely on sentence-level classification and fail to identify the specific toxic spans within those text. It is further exacerbated by the multiple factors i.e. lack of token-level annotated resources, linguistic complexity of Urdu, frequent code-switching, informal expressions, and rich morphological variations. In this research, we propose MUTEX: a multilingual transformer combined with conditional random fields (CRF) for Urdu toxic span detection framework that uses manually annotated token-level toxic span dataset to improve performance and interpretability. MUTEX uses XLM RoBERTa with CRF layer to perform sequence labeling and is tested on multi-domain data extracted from social media, online news, and YouTube reviews using token-level F1 to evaluate fine-grained span detection. The results indicate that MUTEX achieves 60% token-level F1 score that is the first supervised baseline for Urdu toxic span detection. Further examination reveals that transformer-based models are more effective at implicitly capturing the contextual toxicity and are able to address the issues of code-switching and morphological variation than other models.

MUTEX: Leveraging Multilingual Transformers and Conditional Random Fields for Enhanced Urdu Toxic Span Detection

TL;DR

Results indicate that MUTEX achieves 60% token-level F1 score that is the first supervised baseline for Urdu toxic span detection, and reveals that transformer-based models are more effective at implicitly capturing the contextual toxicity and are able to address the issues of code-switching and morphological variation than other models.

Abstract

Urdu toxic span detection remains limited because most existing systems rely on sentence-level classification and fail to identify the specific toxic spans within those text. It is further exacerbated by the multiple factors i.e. lack of token-level annotated resources, linguistic complexity of Urdu, frequent code-switching, informal expressions, and rich morphological variations. In this research, we propose MUTEX: a multilingual transformer combined with conditional random fields (CRF) for Urdu toxic span detection framework that uses manually annotated token-level toxic span dataset to improve performance and interpretability. MUTEX uses XLM RoBERTa with CRF layer to perform sequence labeling and is tested on multi-domain data extracted from social media, online news, and YouTube reviews using token-level F1 to evaluate fine-grained span detection. The results indicate that MUTEX achieves 60% token-level F1 score that is the first supervised baseline for Urdu toxic span detection. Further examination reveals that transformer-based models are more effective at implicitly capturing the contextual toxicity and are able to address the issues of code-switching and morphological variation than other models.
Paper Structure (66 sections, 11 equations, 7 figures, 13 tables, 1 algorithm)

This paper contains 66 sections, 11 equations, 7 figures, 13 tables, 1 algorithm.

Figures (7)

  • Figure 1: Data collection and annotation pipeline for the Urdu toxic span detection dataset.
  • Figure 2: Distribution of toxicity categories in the annotated Urdu dataset
  • Figure 3: Proposed system architecture for Urdu toxic span detection
  • Figure 4: System output interface for Urdu toxic span detection showing highlighted toxic spans with color-coded severity levels
  • Figure 5: Performance comparison of baseline and proposed models on Urdu toxic span detection. XLM-RoBERTa+CRF achieves the highest token-level F1-score of 60.0%, demonstrating the effectiveness of combining multilingual contextual embeddings with structured sequence prediction. The CRF layer improves F1 by 1.0 percentage point over XLM-RoBERTa alone by enforcing valid BIO tag sequences.
  • ...and 2 more figures