EnDive: A Cross-Dialect Benchmark for Fairness and Performance in Large Language Models

Abhay Gupta; Jacob Cheung; Philip Meng; Shayan Sayyed; Austen Liao; Kevin Zhu; Sean O'Brien

EnDive: A Cross-Dialect Benchmark for Fairness and Performance in Large Language Models

Abhay Gupta, Jacob Cheung, Philip Meng, Shayan Sayyed, Austen Liao, Kevin Zhu, Sean O'Brien

TL;DR

EnDive addresses the underrepresentation of intra-language variation by introducing a cross-dialect benchmark that translates SAE datasets into five English dialects across 12 tasks spanning language understanding, algorithmic reasoning, mathematics, and logic. It combines few-shot prompting guided by eWAVE exemplars with BLEU-based filtering and a Multi-VALUE baseline to create challenging, dialect-rich inputs, complemented by human validation. Across five large language models, EnDive reveals consistent performance gaps on dialectal inputs relative to SAE, highlighting biases in current NLP systems and underscoring the need for dialect-aware evaluation and fairness-focused development. By providing a public, multi-dialect benchmark with robust evaluation (ROUGE/Lexical fluency, human judgments, and preference tests), EnDive aims to spur progress toward more equitable language technologies that serve speakers of non-standard dialects.

Abstract

The diversity of human language, shaped by social, cultural, and regional influences, presents significant challenges for natural language processing (NLP) systems. Existing benchmarks often overlook intra-language variations, leaving speakers of non-standard dialects underserved. To address this gap, we introduce EnDive (English Diversity), a benchmark that evaluates five widely-used large language models (LLMs) across tasks in language understanding, algorithmic reasoning, mathematics, and logic. Our framework translates Standard American English datasets into five underrepresented dialects using few-shot prompting with verified examples from native speakers, and compare these translations against rule-based methods via fluency assessments, preference tests, and semantic similarity metrics. Human evaluations confirm high translation quality, with average scores of at least 6.02/7 for faithfulness, fluency, and formality. By filtering out near-identical translations, we create a challenging dataset that reveals significant performance disparities - models consistently underperform on dialectal inputs compared to Standard American English. EnDive thus advances dialect-aware NLP by uncovering model biases and promoting more equitable language technologies.

EnDive: A Cross-Dialect Benchmark for Fairness and Performance in Large Language Models

TL;DR

Abstract

EnDive: A Cross-Dialect Benchmark for Fairness and Performance in Large Language Models

TL;DR

Abstract

Paper Structure

Table of Contents