Are LLMs Ready to Replace Bangla Annotators?
Md. Najib Hasan, Touseef Hasan, Souvika Sarkar
TL;DR
This work critically examines whether large language models can reliably replace human annotators for Bangla hate-speech data. It introduces BAHS, a 5,000-sample Bangla dataset with 21 identity-sensitive categories, and a unified three-dimension evaluation framework (classification, reasoning, prompt sensitivity) to quantify annotator bias and stability across 17 LLMs. The findings reveal that model scale does not consistently improve annotation quality; many models exhibit identity bias and high sensitivity to prompt wording, while reasoning fluency does not guarantee correct labeling. The study highlights the need for rigorous evaluation, domain-specific prompting, and cautious deployment of LLM-based annotators in low-resource, identity-sensitive contexts, with implications for bias amplification and fairness in downstream tasks.
Abstract
Large Language Models (LLMs) are increasingly used as automated annotators to scale dataset creation, yet their reliability as unbiased annotators--especially for low-resource and identity-sensitive settings--remains poorly understood. In this work, we study the behavior of LLMs as zero-shot annotators for Bangla hate speech, a task where even human agreement is challenging, and annotator bias can have serious downstream consequences. We conduct a systematic benchmark of 17 LLMs using a unified evaluation framework. Our analysis uncovers annotator bias and substantial instability in model judgments. Surprisingly, increased model scale does not guarantee improved annotation quality--smaller, more task-aligned models frequently exhibit more consistent behavior than their larger counterparts. These results highlight important limitations of current LLMs for sensitive annotation tasks in low-resource languages and underscore the need for careful evaluation before deployment.
