Table of Contents
Fetching ...

Disparities in LLM Reasoning Accuracy and Explanations: A Case Study on African American English

Runtao Zhou, Guangya Wan, Saadia Gabriel, Sheng Li, Alexander J Gates, Maarten Sap, Thomas Hartvigsen

TL;DR

The paper addresses dialectal disparities in LLM reasoning by pairing automated SAE-to-AAE dialect conversion with structured reasoning assessments across multiple benchmarks and models. It reveals that AAE prompts consistently yield lower accuracy, simpler explanations, and reduced consistency, with the largest gaps in social science and humanities domains, and demonstrates that explanation style and reasoning form modulate these biases. Through a linguistically grounded dialect converter and comprehensive evaluation (accuracy, readability, LIWC-based psychological expressions, and consistency), the work also tests mitigation prompts that can reduce the dialect gap. The findings underscore the need for dialect-aware fairness in AI systems, especially in education and healthcare, and provide practical, evidence-based mitigation strategies and a reproducible evaluation framework.

Abstract

Large Language Models (LLMs) have demonstrated remarkable capabilities in reasoning tasks, leading to their widespread deployment. However, recent studies have highlighted concerning biases in these models, particularly in their handling of dialectal variations like African American English (AAE). In this work, we systematically investigate dialectal disparities in LLM reasoning tasks. We develop an experimental framework comparing LLM performance given Standard American English (SAE) and AAE prompts, combining LLM-based dialect conversion with established linguistic analyses. We find that LLMs consistently produce less accurate responses and simpler reasoning chains and explanations for AAE inputs compared to equivalent SAE questions, with disparities most pronounced in social science and humanities domains. These findings highlight systematic differences in how LLMs process and reason about different language varieties, raising important questions about the development and deployment of these systems in our multilingual and multidialectal world. Our code repository is publicly available at https://github.com/Runtaozhou/dialect_bias_eval.

Disparities in LLM Reasoning Accuracy and Explanations: A Case Study on African American English

TL;DR

The paper addresses dialectal disparities in LLM reasoning by pairing automated SAE-to-AAE dialect conversion with structured reasoning assessments across multiple benchmarks and models. It reveals that AAE prompts consistently yield lower accuracy, simpler explanations, and reduced consistency, with the largest gaps in social science and humanities domains, and demonstrates that explanation style and reasoning form modulate these biases. Through a linguistically grounded dialect converter and comprehensive evaluation (accuracy, readability, LIWC-based psychological expressions, and consistency), the work also tests mitigation prompts that can reduce the dialect gap. The findings underscore the need for dialect-aware fairness in AI systems, especially in education and healthcare, and provide practical, evidence-based mitigation strategies and a reproducible evaluation framework.

Abstract

Large Language Models (LLMs) have demonstrated remarkable capabilities in reasoning tasks, leading to their widespread deployment. However, recent studies have highlighted concerning biases in these models, particularly in their handling of dialectal variations like African American English (AAE). In this work, we systematically investigate dialectal disparities in LLM reasoning tasks. We develop an experimental framework comparing LLM performance given Standard American English (SAE) and AAE prompts, combining LLM-based dialect conversion with established linguistic analyses. We find that LLMs consistently produce less accurate responses and simpler reasoning chains and explanations for AAE inputs compared to equivalent SAE questions, with disparities most pronounced in social science and humanities domains. These findings highlight systematic differences in how LLMs process and reason about different language varieties, raising important questions about the development and deployment of these systems in our multilingual and multidialectal world. Our code repository is publicly available at https://github.com/Runtaozhou/dialect_bias_eval.

Paper Structure

This paper contains 47 sections, 1 equation, 5 figures, 10 tables.

Figures (5)

  • Figure 1: The experiment simulates a question-and-answer session to evaluate potential language model biases when responding to different English dialects. Specifically, it compares the accuracy and consistency of responses to prompts written in African American English (AAE) versus Standard American English (SAE). The study also analyzes the explanations provided in SAE, as it is the case in many applications, examining their consistency, readability, and psychological expression.
  • Figure 2: Linguistic Marker Differences in Explanations for AAE and SAE Prompts: Frequencies of linguistic markers, calculated by LIWC and standardized per 1 K tokens; marked with ** and * for statistical significance (**: p < 0.01, *: 0.01 <= p < 0.05).
  • Figure 3: Average proportion of annotators favoring each SAE-AAE converter across four metrics (Gupta et al., 2024) each metric is marked with * for statistical significance (**p < 0.01).
  • Figure 4: Sample question that we ask annotator to rank the converted AAE and SAE sentences based on certain metrics.
  • Figure 5: Sample question that we ask annotator to realism of the converted AAE and SAE sentences on a scale from 0-10