AAVENUE: Detecting LLM Biases on NLU Tasks in AAVE via a Novel Benchmark

Abhay Gupta; Philip Meng; Ece Yurtseven; Sean O'Brien; Kevin Zhu

AAVENUE: Detecting LLM Biases on NLU Tasks in AAVE via a Novel Benchmark

Abhay Gupta, Philip Meng, Ece Yurtseven, Sean O'Brien, Kevin Zhu

TL;DR

Motivation to address dialectal fairness in NLP and the limitations of SAE-focused benchmarks. AAVENUE introduces a translation-based evaluation framework that converts SAE NLU tasks from GLUE/SuperGLUE into AAVE using few-shot prompting with GPT-4o-mini, with fluent AAVE validators confirming authenticity. The paper demonstrates that, across five tasks and multiple LLMs, SAE translations generally outperform AAVE translations, revealing persistent dialect biases that hinder equitable NLP performance. The work provides open-source code and a public-facing resource, highlighting the need for dialect-inclusive training data and model architectures to achieve fairer language technologies for diverse communities.

Abstract

Detecting biases in natural language understanding (NLU) for African American Vernacular English (AAVE) is crucial to developing inclusive natural language processing (NLP) systems. To address dialect-induced performance discrepancies, we introduce AAVENUE ({AAVE} {N}atural Language {U}nderstanding {E}valuation), a benchmark for evaluating large language model (LLM) performance on NLU tasks in AAVE and Standard American English (SAE). AAVENUE builds upon and extends existing benchmarks like VALUE, replacing deterministic syntactic and morphological transformations with a more flexible methodology leveraging LLM-based translation with few-shot prompting, improving performance across our evaluation metrics when translating key tasks from the GLUE and SuperGLUE benchmarks. We compare AAVENUE and VALUE translations using five popular LLMs and a comprehensive set of metrics including fluency, BARTScore, quality, coherence, and understandability. Additionally, we recruit fluent AAVE speakers to validate our translations for authenticity. Our evaluations reveal that LLMs consistently perform better on SAE tasks than AAVE-translated versions, underscoring inherent biases and highlighting the need for more inclusive NLP models. We have open-sourced our source code on GitHub and created a website to showcase our work at https://aavenuee.github.io.

AAVENUE: Detecting LLM Biases on NLU Tasks in AAVE via a Novel Benchmark

TL;DR

Abstract

AAVENUE: Detecting LLM Biases on NLU Tasks in AAVE via a Novel Benchmark

Authors

TL;DR

Abstract

Table of Contents