Table of Contents
Fetching ...

Who Wrote This? Identifying Machine vs Human-Generated Text in Hausa

Babangida Sani, Aakansha Soy, Sukairaj Hafiz Imam, Ahmad Mustapha, Lukman Jibril Aliyu, Idris Abdulmumin, Ibrahim Said Ahmad, Shamsuddeen Hassan Muhammad

TL;DR

This work tackles the problem of differentiating human- and machine-generated text in Hausa, a low-resource language, by building a labeled Hausa dataset from seven news outlets and generating machine content with Gemini-2.0. Four Afri-centric transformer models are fine-tuned on this data, with AfroXLMR achieving the best performance at $ACC=0.9923$ and $F1=0.9921$, while other models also perform strongly. The study contributes the first large-scale Hausa detector and an openly available dataset, demonstrating that African-language–optimized multilingual models can effectively detect machine-generated text in Hausa. The results have practical implications for content authenticity in Hausa-language media and provide a foundation for future cross-domain and cross-language detection work in low-resource settings.

Abstract

The advancement of large language models (LLMs) has allowed them to be proficient in various tasks, including content generation. However, their unregulated usage can lead to malicious activities such as plagiarism and generating and spreading fake news, especially for low-resource languages. Most existing machine-generated text detectors are trained on high-resource languages like English, French, etc. In this study, we developed the first large-scale detector that can distinguish between human- and machine-generated content in Hausa. We scrapped seven Hausa-language media outlets for the human-generated text and the Gemini-2.0 flash model to automatically generate the corresponding Hausa-language articles based on the human-generated article headlines. We fine-tuned four pre-trained Afri-centric models (AfriTeVa, AfriBERTa, AfroXLMR, and AfroXLMR-76L) on the resulting dataset and assessed their performance using accuracy and F1-score metrics. AfroXLMR achieved the highest performance with an accuracy of 99.23% and an F1 score of 99.21%, demonstrating its effectiveness for Hausa text detection. Our dataset is made publicly available to enable further research.

Who Wrote This? Identifying Machine vs Human-Generated Text in Hausa

TL;DR

This work tackles the problem of differentiating human- and machine-generated text in Hausa, a low-resource language, by building a labeled Hausa dataset from seven news outlets and generating machine content with Gemini-2.0. Four Afri-centric transformer models are fine-tuned on this data, with AfroXLMR achieving the best performance at and , while other models also perform strongly. The study contributes the first large-scale Hausa detector and an openly available dataset, demonstrating that African-language–optimized multilingual models can effectively detect machine-generated text in Hausa. The results have practical implications for content authenticity in Hausa-language media and provide a foundation for future cross-domain and cross-language detection work in low-resource settings.

Abstract

The advancement of large language models (LLMs) has allowed them to be proficient in various tasks, including content generation. However, their unregulated usage can lead to malicious activities such as plagiarism and generating and spreading fake news, especially for low-resource languages. Most existing machine-generated text detectors are trained on high-resource languages like English, French, etc. In this study, we developed the first large-scale detector that can distinguish between human- and machine-generated content in Hausa. We scrapped seven Hausa-language media outlets for the human-generated text and the Gemini-2.0 flash model to automatically generate the corresponding Hausa-language articles based on the human-generated article headlines. We fine-tuned four pre-trained Afri-centric models (AfriTeVa, AfriBERTa, AfroXLMR, and AfroXLMR-76L) on the resulting dataset and assessed their performance using accuracy and F1-score metrics. AfroXLMR achieved the highest performance with an accuracy of 99.23% and an F1 score of 99.21%, demonstrating its effectiveness for Hausa text detection. Our dataset is made publicly available to enable further research.

Paper Structure

This paper contains 14 sections, 1 figure, 3 tables.

Figures (1)

  • Figure 1: Overview of the pipeline's data collection for human-generated texts.