DIALECTBENCH: A NLP Benchmark for Dialects, Varieties, and Closely-Related Languages

Fahim Faisal; Orevaoghene Ahia; Aarohi Srivastava; Kabir Ahuja; David Chiang; Yulia Tsvetkov; Antonios Anastasopoulos

DIALECTBENCH: A NLP Benchmark for Dialects, Varieties, and Closely-Related Languages

Fahim Faisal, Orevaoghene Ahia, Aarohi Srivastava, Kabir Ahuja, David Chiang, Yulia Tsvetkov, Antonios Anastasopoulos

TL;DR

DialectBench introduces the first large-scale NLP benchmark focused on language varieties, aggregating 281 varieties across 40 language clusters and 10 tasks to reveal systematic disparities between standard and non-standard varieties. The work details a comprehensive framework for variety selection, cluster mapping, representative design, task inclusion, and evaluation principles, and provides baseline results using mBERT, XLM-R, NLLB, and Mistral-7B across zero-shot, fine-tuning, and in-context learning settings. It quantifies dialectal gaps with a relative metric across tasks and clusters, highlights the impact of script and data availability, and discusses implications for model robustness and data quality. Overall, the study establishes a foundation for rigorous dialectal NLP benchmarking and points to practical paths for expanding resources, refining metrics, and improving cross-linguistic robustness in real-world settings.

Abstract

Language technologies should be judged on their usefulness in real-world use cases. An often overlooked aspect in natural language processing (NLP) research and evaluation is language variation in the form of non-standard dialects or language varieties (hereafter, varieties). Most NLP benchmarks are limited to standard language varieties. To fill this gap, we propose DIALECTBENCH, the first-ever large-scale benchmark for NLP on varieties, which aggregates an extensive set of task-varied variety datasets (10 text-level tasks covering 281 varieties). This allows for a comprehensive evaluation of NLP system performance on different language varieties. We provide substantial evidence of performance disparities between standard and non-standard language varieties, and we also identify language clusters with large performance divergence across tasks. We believe DIALECTBENCH provides a comprehensive view of the current state of NLP for language varieties and one step towards advancing it further. Code/data: https://github.com/ffaisal93/DialectBench

DIALECTBENCH: A NLP Benchmark for Dialects, Varieties, and Closely-Related Languages

TL;DR

Abstract

Paper Structure (55 sections, 2 equations, 11 figures, 26 tables)

This paper contains 55 sections, 2 equations, 11 figures, 26 tables.

Introduction
DialectBench
Variety Selection
Cluster-Variety Mapping
Cluster Representative
Task and Dataset Selection
Evaluation Principles
Experiments
Models
Training and Evaluation
Quantifying the Dialectal Gap
Results
Maximum Obtainable Scores
Structured prediction
Sequence classification
...and 40 more sections

Figures (11)

Figure 1: DialectBench Evaluation Suite.
Figure 2: DialectBench language clusters with their variety counts per task. "Other" encompasses 18 clusters (full cluster list in Appendix \ref{['lang_list']}).
Figure 3: Maximum scores (max. UAS) in Dep. Parsing task. Yellow-shaded region: Komi is the only cluster having no varieties seen during mBERT pertaining. Colored bars with diagonal stripes: the clusterrepresentativevariety. Low-resourced clustervarieties score lower compared to high-resource Germanic clusters.
Figure 4: Dialectal gap visualization for language clusters utilizing zero-shot cross-lingual transfer from Standard English. In the x-axis, values far from zero have a larger performance gap from English whereas, in the y-axis, values far from zero have a larger within cluster gap. Ideally, we want both of them to be close to zero.
Figure 5: Map of Switzerland with aggregated BLEU scores of Swiss-German variety per region
...and 6 more figures

DIALECTBENCH: A NLP Benchmark for Dialects, Varieties, and Closely-Related Languages

TL;DR

Abstract

DIALECTBENCH: A NLP Benchmark for Dialects, Varieties, and Closely-Related Languages

Authors

TL;DR

Abstract

Table of Contents

Figures (11)