Table of Contents
Fetching ...

cantnlp@LT-EDI-2024: Automatic Detection of Anti-LGBTQ+ Hate Speech in Under-resourced Languages

Sidney G. -J. Wong, Matthew Durward

TL;DR

The paper tackles automatic detection of anti-LGBTQ+ hate speech across ten under-resourced language conditions using transformer-based models (xlm-roberta) with domain adaptation that incorporates synthetic and organic script-switching to reflect social-media language realities. It compares baseline, synthetic, and organic retrained models, including mono and multilingual setups, and reports results such as Telugu reaching ~0.97 and English around ~0.32 in Macro F1. The findings suggest that script-switching as a paralinguistic cue can improve performance for several languages, though gains are uneven due to data size and class imbalance. The work underscores the potential of script-switching-informed domain adaptation for multilingual hate-speech detection while highlighting ethical considerations and the need for community-driven validation. Overall, the study provides practical guidance for deploying multilingual hate-speech systems in under-resourced languages and points to future directions involving cultural and linguistic context awareness.

Abstract

This paper describes our homophobia/transphobia in social media comments detection system developed as part of the shared task at LT-EDI-2024. We took a transformer-based approach to develop our multiclass classification model for ten language conditions (English, Spanish, Gujarati, Hindi, Kannada, Malayalam, Marathi, Tamil, Tulu, and Telugu). We introduced synthetic and organic instances of script-switched language data during domain adaptation to mirror the linguistic realities of social media language as seen in the labelled training data. Our system ranked second for Gujarati and Telugu with varying levels of performance for other language conditions. The results suggest incorporating elements of paralinguistic behaviour such as script-switching may improve the performance of language detection systems especially in the cases of under-resourced languages conditions.

cantnlp@LT-EDI-2024: Automatic Detection of Anti-LGBTQ+ Hate Speech in Under-resourced Languages

TL;DR

The paper tackles automatic detection of anti-LGBTQ+ hate speech across ten under-resourced language conditions using transformer-based models (xlm-roberta) with domain adaptation that incorporates synthetic and organic script-switching to reflect social-media language realities. It compares baseline, synthetic, and organic retrained models, including mono and multilingual setups, and reports results such as Telugu reaching ~0.97 and English around ~0.32 in Macro F1. The findings suggest that script-switching as a paralinguistic cue can improve performance for several languages, though gains are uneven due to data size and class imbalance. The work underscores the potential of script-switching-informed domain adaptation for multilingual hate-speech detection while highlighting ethical considerations and the need for community-driven validation. Overall, the study provides practical guidance for deploying multilingual hate-speech systems in under-resourced languages and points to future directions involving cultural and linguistic context awareness.

Abstract

This paper describes our homophobia/transphobia in social media comments detection system developed as part of the shared task at LT-EDI-2024. We took a transformer-based approach to develop our multiclass classification model for ten language conditions (English, Spanish, Gujarati, Hindi, Kannada, Malayalam, Marathi, Tamil, Tulu, and Telugu). We introduced synthetic and organic instances of script-switched language data during domain adaptation to mirror the linguistic realities of social media language as seen in the labelled training data. Our system ranked second for Gujarati and Telugu with varying levels of performance for other language conditions. The results suggest incorporating elements of paralinguistic behaviour such as script-switching may improve the performance of language detection systems especially in the cases of under-resourced languages conditions.
Paper Structure (11 sections, 2 figures, 3 tables)

This paper contains 11 sections, 2 figures, 3 tables.

Figures (2)

  • Figure 1: Barplot of labelled training data. The combined total number of observations (in thousands) by language condition ordered from the most (kan) to the least (tcy) number of observations.
  • Figure 2: Boxplot of labelled training data. Language condition by the proportion of observations with at least one word written in Latin script ordered from the lowest (tcy) to the highest (guj) proportion of observations.