Table of Contents
Fetching ...

DIALECTBENCH: A NLP Benchmark for Dialects, Varieties, and Closely-Related Languages

Fahim Faisal, Orevaoghene Ahia, Aarohi Srivastava, Kabir Ahuja, David Chiang, Yulia Tsvetkov, Antonios Anastasopoulos

TL;DR

DialectBench introduces the first large-scale NLP benchmark focused on language varieties, aggregating 281 varieties across 40 language clusters and 10 tasks to reveal systematic disparities between standard and non-standard varieties. The work details a comprehensive framework for variety selection, cluster mapping, representative design, task inclusion, and evaluation principles, and provides baseline results using mBERT, XLM-R, NLLB, and Mistral-7B across zero-shot, fine-tuning, and in-context learning settings. It quantifies dialectal gaps with a relative metric across tasks and clusters, highlights the impact of script and data availability, and discusses implications for model robustness and data quality. Overall, the study establishes a foundation for rigorous dialectal NLP benchmarking and points to practical paths for expanding resources, refining metrics, and improving cross-linguistic robustness in real-world settings.

Abstract

Language technologies should be judged on their usefulness in real-world use cases. An often overlooked aspect in natural language processing (NLP) research and evaluation is language variation in the form of non-standard dialects or language varieties (hereafter, varieties). Most NLP benchmarks are limited to standard language varieties. To fill this gap, we propose DIALECTBENCH, the first-ever large-scale benchmark for NLP on varieties, which aggregates an extensive set of task-varied variety datasets (10 text-level tasks covering 281 varieties). This allows for a comprehensive evaluation of NLP system performance on different language varieties. We provide substantial evidence of performance disparities between standard and non-standard language varieties, and we also identify language clusters with large performance divergence across tasks. We believe DIALECTBENCH provides a comprehensive view of the current state of NLP for language varieties and one step towards advancing it further. Code/data: https://github.com/ffaisal93/DialectBench

DIALECTBENCH: A NLP Benchmark for Dialects, Varieties, and Closely-Related Languages

TL;DR

DialectBench introduces the first large-scale NLP benchmark focused on language varieties, aggregating 281 varieties across 40 language clusters and 10 tasks to reveal systematic disparities between standard and non-standard varieties. The work details a comprehensive framework for variety selection, cluster mapping, representative design, task inclusion, and evaluation principles, and provides baseline results using mBERT, XLM-R, NLLB, and Mistral-7B across zero-shot, fine-tuning, and in-context learning settings. It quantifies dialectal gaps with a relative metric across tasks and clusters, highlights the impact of script and data availability, and discusses implications for model robustness and data quality. Overall, the study establishes a foundation for rigorous dialectal NLP benchmarking and points to practical paths for expanding resources, refining metrics, and improving cross-linguistic robustness in real-world settings.

Abstract

Language technologies should be judged on their usefulness in real-world use cases. An often overlooked aspect in natural language processing (NLP) research and evaluation is language variation in the form of non-standard dialects or language varieties (hereafter, varieties). Most NLP benchmarks are limited to standard language varieties. To fill this gap, we propose DIALECTBENCH, the first-ever large-scale benchmark for NLP on varieties, which aggregates an extensive set of task-varied variety datasets (10 text-level tasks covering 281 varieties). This allows for a comprehensive evaluation of NLP system performance on different language varieties. We provide substantial evidence of performance disparities between standard and non-standard language varieties, and we also identify language clusters with large performance divergence across tasks. We believe DIALECTBENCH provides a comprehensive view of the current state of NLP for language varieties and one step towards advancing it further. Code/data: https://github.com/ffaisal93/DialectBench
Paper Structure (55 sections, 2 equations, 11 figures, 26 tables)

This paper contains 55 sections, 2 equations, 11 figures, 26 tables.

Figures (11)

  • Figure 1: DialectBench Evaluation Suite.
  • Figure 2: DialectBench language clusters with their variety counts per task. "Other" encompasses 18 clusters (full cluster list in Appendix \ref{['lang_list']}).
  • Figure 3: Maximum scores (max. UAS) in Dep. Parsing task. Yellow-shaded region: Komi is the only cluster having no varieties seen during mBERT pertaining. Colored bars with diagonal stripes: the clusterrepresentativevariety. Low-resourced clustervarieties score lower compared to high-resource Germanic clusters.
  • Figure 4: Dialectal gap visualization for language clusters utilizing zero-shot cross-lingual transfer from Standard English. In the x-axis, values far from zero have a larger performance gap from English whereas, in the y-axis, values far from zero have a larger within cluster gap. Ideally, we want both of them to be close to zero.
  • Figure 5: Map of Switzerland with aggregated BLEU scores of Swiss-German variety per region
  • ...and 6 more figures