Table of Contents
Fetching ...

Dialectal Toxicity Detection: Evaluating LLM-as-a-Judge Consistency Across Language Varieties

Fahim Faisal, Md Mushfiqur Rahman, Antonios Anastasopoulos

Abstract

There has been little systematic study on how dialectal differences affect toxicity detection by modern LLMs. Furthermore, although using LLMs as evaluators ("LLM-as-a-judge") is a growing research area, their sensitivity to dialectal nuances is still underexplored and requires more focused attention. In this paper, we address these gaps through a comprehensive toxicity evaluation of LLMs across diverse dialects. We create a multi-dialect dataset through synthetic transformations and human-assisted translations, covering 10 language clusters and 60 varieties. We then evaluated three LLMs on their ability to assess toxicity across multilingual, dialectal, and LLM-human consistency. Our findings show that LLMs are sensitive in handling both multilingual and dialectal variations. However, if we have to rank the consistency, the weakest area is LLM-human agreement, followed by dialectal consistency. Code repository: \url{https://github.com/ffaisal93/dialect_toxicity_llm_judge}

Dialectal Toxicity Detection: Evaluating LLM-as-a-Judge Consistency Across Language Varieties

Abstract

There has been little systematic study on how dialectal differences affect toxicity detection by modern LLMs. Furthermore, although using LLMs as evaluators ("LLM-as-a-judge") is a growing research area, their sensitivity to dialectal nuances is still underexplored and requires more focused attention. In this paper, we address these gaps through a comprehensive toxicity evaluation of LLMs across diverse dialects. We create a multi-dialect dataset through synthetic transformations and human-assisted translations, covering 10 language clusters and 60 varieties. We then evaluated three LLMs on their ability to assess toxicity across multilingual, dialectal, and LLM-human consistency. Our findings show that LLMs are sensitive in handling both multilingual and dialectal variations. However, if we have to rank the consistency, the weakest area is LLM-human agreement, followed by dialectal consistency. Code repository: \url{https://github.com/ffaisal93/dialect_toxicity_llm_judge}

Paper Structure

This paper contains 21 sections, 5 figures, 10 tables.

Figures (5)

  • Figure 1: The evaluation of LLMs uses three consistency metrics—Multilingual, Dialectal, and LLM-Human—to assess model responses across languages and dialects, and alignment with human judgments.
  • Figure 2: Overview of the dialectal dataset expansion: The figure shows the process of creating a multilingual, multi-dialect toxicity dataset through machine translation and dialect synthesis, enriched with real-world speaker utterances.
  • Figure 3: Rubric and instruction prompt for LLM-based toxicity evaluation across dialects.
  • Figure 4: We compute F1 scores for each language cluster by averaging over all dialects. The original ToxiGen intent labels, which range continuously from 1 to 5, are converted into bins of 3 and 5 for evaluation. The results indicate an overall low agreement (below 50%) with human annotations, where NeMo has the highest scores. The overall performance tends to decrease for low-resource language varieties
  • Figure 5: This bar plot shows how different language models (Phi-3, Aya-23, and NeMo) perform in terms of valid formatted output across multiple language clusters when given multilingual instructions. While all models achieve high performance for English, their performance varies significantly for other languages. Aya-23 generally performs better, whereas Phi-3 struggles more across most languages.