cantnlp@LT-EDI-2024: Automatic Detection of Anti-LGBTQ+ Hate Speech in Under-resourced Languages
Sidney G. -J. Wong, Matthew Durward
TL;DR
The paper tackles automatic detection of anti-LGBTQ+ hate speech across ten under-resourced language conditions using transformer-based models (xlm-roberta) with domain adaptation that incorporates synthetic and organic script-switching to reflect social-media language realities. It compares baseline, synthetic, and organic retrained models, including mono and multilingual setups, and reports results such as Telugu reaching ~0.97 and English around ~0.32 in Macro F1. The findings suggest that script-switching as a paralinguistic cue can improve performance for several languages, though gains are uneven due to data size and class imbalance. The work underscores the potential of script-switching-informed domain adaptation for multilingual hate-speech detection while highlighting ethical considerations and the need for community-driven validation. Overall, the study provides practical guidance for deploying multilingual hate-speech systems in under-resourced languages and points to future directions involving cultural and linguistic context awareness.
Abstract
This paper describes our homophobia/transphobia in social media comments detection system developed as part of the shared task at LT-EDI-2024. We took a transformer-based approach to develop our multiclass classification model for ten language conditions (English, Spanish, Gujarati, Hindi, Kannada, Malayalam, Marathi, Tamil, Tulu, and Telugu). We introduced synthetic and organic instances of script-switched language data during domain adaptation to mirror the linguistic realities of social media language as seen in the labelled training data. Our system ranked second for Gujarati and Telugu with varying levels of performance for other language conditions. The results suggest incorporating elements of paralinguistic behaviour such as script-switching may improve the performance of language detection systems especially in the cases of under-resourced languages conditions.
