Disparities in LLM Reasoning Accuracy and Explanations: A Case Study on African American English

Runtao Zhou; Guangya Wan; Saadia Gabriel; Sheng Li; Alexander J Gates; Maarten Sap; Thomas Hartvigsen

Disparities in LLM Reasoning Accuracy and Explanations: A Case Study on African American English

Runtao Zhou, Guangya Wan, Saadia Gabriel, Sheng Li, Alexander J Gates, Maarten Sap, Thomas Hartvigsen

TL;DR

The paper addresses dialectal disparities in LLM reasoning by pairing automated SAE-to-AAE dialect conversion with structured reasoning assessments across multiple benchmarks and models. It reveals that AAE prompts consistently yield lower accuracy, simpler explanations, and reduced consistency, with the largest gaps in social science and humanities domains, and demonstrates that explanation style and reasoning form modulate these biases. Through a linguistically grounded dialect converter and comprehensive evaluation (accuracy, readability, LIWC-based psychological expressions, and consistency), the work also tests mitigation prompts that can reduce the dialect gap. The findings underscore the need for dialect-aware fairness in AI systems, especially in education and healthcare, and provide practical, evidence-based mitigation strategies and a reproducible evaluation framework.

Abstract

Large Language Models (LLMs) have demonstrated remarkable capabilities in reasoning tasks, leading to their widespread deployment. However, recent studies have highlighted concerning biases in these models, particularly in their handling of dialectal variations like African American English (AAE). In this work, we systematically investigate dialectal disparities in LLM reasoning tasks. We develop an experimental framework comparing LLM performance given Standard American English (SAE) and AAE prompts, combining LLM-based dialect conversion with established linguistic analyses. We find that LLMs consistently produce less accurate responses and simpler reasoning chains and explanations for AAE inputs compared to equivalent SAE questions, with disparities most pronounced in social science and humanities domains. These findings highlight systematic differences in how LLMs process and reason about different language varieties, raising important questions about the development and deployment of these systems in our multilingual and multidialectal world. Our code repository is publicly available at https://github.com/Runtaozhou/dialect_bias_eval.

Disparities in LLM Reasoning Accuracy and Explanations: A Case Study on African American English

TL;DR

Abstract

Disparities in LLM Reasoning Accuracy and Explanations: A Case Study on African American English

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (5)