Table of Contents
Fetching ...

Are LLMs Ready to Replace Bangla Annotators?

Md. Najib Hasan, Touseef Hasan, Souvika Sarkar

TL;DR

This work critically examines whether large language models can reliably replace human annotators for Bangla hate-speech data. It introduces BAHS, a 5,000-sample Bangla dataset with 21 identity-sensitive categories, and a unified three-dimension evaluation framework (classification, reasoning, prompt sensitivity) to quantify annotator bias and stability across 17 LLMs. The findings reveal that model scale does not consistently improve annotation quality; many models exhibit identity bias and high sensitivity to prompt wording, while reasoning fluency does not guarantee correct labeling. The study highlights the need for rigorous evaluation, domain-specific prompting, and cautious deployment of LLM-based annotators in low-resource, identity-sensitive contexts, with implications for bias amplification and fairness in downstream tasks.

Abstract

Large Language Models (LLMs) are increasingly used as automated annotators to scale dataset creation, yet their reliability as unbiased annotators--especially for low-resource and identity-sensitive settings--remains poorly understood. In this work, we study the behavior of LLMs as zero-shot annotators for Bangla hate speech, a task where even human agreement is challenging, and annotator bias can have serious downstream consequences. We conduct a systematic benchmark of 17 LLMs using a unified evaluation framework. Our analysis uncovers annotator bias and substantial instability in model judgments. Surprisingly, increased model scale does not guarantee improved annotation quality--smaller, more task-aligned models frequently exhibit more consistent behavior than their larger counterparts. These results highlight important limitations of current LLMs for sensitive annotation tasks in low-resource languages and underscore the need for careful evaluation before deployment.

Are LLMs Ready to Replace Bangla Annotators?

TL;DR

This work critically examines whether large language models can reliably replace human annotators for Bangla hate-speech data. It introduces BAHS, a 5,000-sample Bangla dataset with 21 identity-sensitive categories, and a unified three-dimension evaluation framework (classification, reasoning, prompt sensitivity) to quantify annotator bias and stability across 17 LLMs. The findings reveal that model scale does not consistently improve annotation quality; many models exhibit identity bias and high sensitivity to prompt wording, while reasoning fluency does not guarantee correct labeling. The study highlights the need for rigorous evaluation, domain-specific prompting, and cautious deployment of LLM-based annotators in low-resource, identity-sensitive contexts, with implications for bias amplification and fairness in downstream tasks.

Abstract

Large Language Models (LLMs) are increasingly used as automated annotators to scale dataset creation, yet their reliability as unbiased annotators--especially for low-resource and identity-sensitive settings--remains poorly understood. In this work, we study the behavior of LLMs as zero-shot annotators for Bangla hate speech, a task where even human agreement is challenging, and annotator bias can have serious downstream consequences. We conduct a systematic benchmark of 17 LLMs using a unified evaluation framework. Our analysis uncovers annotator bias and substantial instability in model judgments. Surprisingly, increased model scale does not guarantee improved annotation quality--smaller, more task-aligned models frequently exhibit more consistent behavior than their larger counterparts. These results highlight important limitations of current LLMs for sensitive annotation tasks in low-resource languages and underscore the need for careful evaluation before deployment.
Paper Structure (28 sections, 1 equation, 12 figures, 16 tables, 1 algorithm)

This paper contains 28 sections, 1 equation, 12 figures, 16 tables, 1 algorithm.

Figures (12)

  • Figure 1: Identity-sensitive bias evaluation. We send the same text, labels, and prompt level, then only change the identity of the annotator (LLM as Male vs. Female). Human evaluator checks whether predicted labels differ across identities.
  • Figure 2: Overview of the Three-Dimension Evaluation Framework for Annotator Bias in LLMs on Bengali Hate Speech. The diagram illustrates the end-to-end pipeline of evaluating LLMs on identity-sensitive hate speech annotation in Bangla. It includes data collection, preprocessing, and final evaluation across three dimensions: E1 (Classification of hateful comments), E2 (Reasoning Alignment), and E3 (Prompt Sensitivity based on TELeR taxonomy). The framework incorporates diverse identity groups (e.g., religious, professional, geopolitical) and categorizes models into Human-like, Objective, or Adaptive annotators based on performance.
  • Figure 3: Benchmarking and performance evaluation across 17 LLMs in the Bangla_HateSpeech dataset. Subfigure (a) shows the performance of comment classification (E1) for 11 positive categories using F1 score, while subfigure (b) shows the performance of reasoning alignment (E2) using BERTScore.
  • Figure 4: F1 score-based prompt-level bias in the Gender category (Female vs. Not Female) across 17 LLMs. Each boxplot shows performance variation across four prompt levels. Wider spreads indicate greater sensitivity to prompt wording and instability in annotation.
  • Figure 5: Performance evaluation across 17 LLMs in classification task for Positive categories using Cohen's Kappa measure.
  • ...and 7 more figures