Table of Contents
Fetching ...

Dataset Creation and Baseline Models for Sexism Detection in Hausa

Fatima Adam Muhammad, Shamsuddeen Muhammad Hassan, Isa Inuwa-Dutse

TL;DR

This paper addresses sexism detection in Hausa, a low-resource language, by creating the first Hausa sexism dataset through community engagement, qualitative coding, and data augmentation. It conducts a two-stage user study with $n=66$ native speakers (pilot $n=33$, main $n=33$) to ground culturally valid definitions and expressions, and evaluates both traditional classifiers and pre-trained multilingual models, including few-shot prompting. The contributions include the Hausa sexism dataset, a data-augmentation pipeline that leverages English resources, and a comprehensive baseline evaluation showing that few-shot learning with LLMs can approach or exceed traditional baselines, while cultural nuances remain a source of false positives. The work advances resources for low-resource language NLP and demonstrates a community-driven approach to building context-aware moderation tools.

Abstract

Sexism reinforces gender inequality and social exclusion by perpetuating stereotypes, bias, and discriminatory norms. Noting how online platforms enable various forms of sexism to thrive, there is a growing need for effective sexism detection and mitigation strategies. While computational approaches to sexism detection are widespread in high-resource languages, progress remains limited in low-resource languages where limited linguistic resources and cultural differences affect how sexism is expressed and perceived. This study introduces the first Hausa sexism detection dataset, developed through community engagement, qualitative coding, and data augmentation. For cultural nuances and linguistic representation, we conducted a two-stage user study (n=66) involving native speakers to explore how sexism is defined and articulated in everyday discourse. We further experiment with both traditional machine learning classifiers and pre-trained multilingual language models and evaluating the effectiveness few-shot learning in detecting sexism in Hausa. Our findings highlight challenges in capturing cultural nuance, particularly with clarification-seeking and idiomatic expressions, and reveal a tendency for many false positives in such cases.

Dataset Creation and Baseline Models for Sexism Detection in Hausa

TL;DR

This paper addresses sexism detection in Hausa, a low-resource language, by creating the first Hausa sexism dataset through community engagement, qualitative coding, and data augmentation. It conducts a two-stage user study with native speakers (pilot , main ) to ground culturally valid definitions and expressions, and evaluates both traditional classifiers and pre-trained multilingual models, including few-shot prompting. The contributions include the Hausa sexism dataset, a data-augmentation pipeline that leverages English resources, and a comprehensive baseline evaluation showing that few-shot learning with LLMs can approach or exceed traditional baselines, while cultural nuances remain a source of false positives. The work advances resources for low-resource language NLP and demonstrates a community-driven approach to building context-aware moderation tools.

Abstract

Sexism reinforces gender inequality and social exclusion by perpetuating stereotypes, bias, and discriminatory norms. Noting how online platforms enable various forms of sexism to thrive, there is a growing need for effective sexism detection and mitigation strategies. While computational approaches to sexism detection are widespread in high-resource languages, progress remains limited in low-resource languages where limited linguistic resources and cultural differences affect how sexism is expressed and perceived. This study introduces the first Hausa sexism detection dataset, developed through community engagement, qualitative coding, and data augmentation. For cultural nuances and linguistic representation, we conducted a two-stage user study (n=66) involving native speakers to explore how sexism is defined and articulated in everyday discourse. We further experiment with both traditional machine learning classifiers and pre-trained multilingual language models and evaluating the effectiveness few-shot learning in detecting sexism in Hausa. Our findings highlight challenges in capturing cultural nuance, particularly with clarification-seeking and idiomatic expressions, and reveal a tendency for many false positives in such cases.

Paper Structure

This paper contains 15 sections, 1 figure, 4 tables.

Figures (1)

  • Figure 1: Overview of our approach depicting data sources (consisting of user study and data augmentation) and baseline models development.