Table of Contents
Fetching ...

Which English Do LLMs Prefer? Triangulating Structural Bias Towards American English in Foundation Models

Mir Tafseer Nayeem, Davood Rafiei

Abstract

Large language models (LLMs) are increasingly deployed in high-stakes domains, yet they expose only limited language settings, most notably "English (US)," despite the global diversity and colonial history of English. Through a postcolonial framing to explain the broader significance, we investigate how geopolitical histories of data curation, digital dominance, and linguistic standardization shape the LLM development pipeline. Focusing on two dominant standard varieties, American English (AmE) and British English (BrE), we construct a curated corpus of 1,813 AmE--BrE variants and introduce DiAlign, a dynamic, training-free method for estimating dialectal alignment using distributional evidence. We operationalize structural bias by triangulating evidence across three stages: (i) audits of six major pretraining corpora reveal systematic skew toward AmE, (ii) tokenizer analyses show that BrE forms incur higher segmentation costs, and (iii) generative evaluations show a persistent AmE preference in model outputs. To our knowledge, this is the first systematic and multi-faceted examination of dialectal asymmetries in standard English varieties across the phases of LLM development. We find that contemporary LLMs privilege AmE as the de facto norm, raising concerns about linguistic homogenization, epistemic injustice, and inequity in global AI deployment, while motivating practical steps toward more dialectally inclusive language technologies.

Which English Do LLMs Prefer? Triangulating Structural Bias Towards American English in Foundation Models

Abstract

Large language models (LLMs) are increasingly deployed in high-stakes domains, yet they expose only limited language settings, most notably "English (US)," despite the global diversity and colonial history of English. Through a postcolonial framing to explain the broader significance, we investigate how geopolitical histories of data curation, digital dominance, and linguistic standardization shape the LLM development pipeline. Focusing on two dominant standard varieties, American English (AmE) and British English (BrE), we construct a curated corpus of 1,813 AmE--BrE variants and introduce DiAlign, a dynamic, training-free method for estimating dialectal alignment using distributional evidence. We operationalize structural bias by triangulating evidence across three stages: (i) audits of six major pretraining corpora reveal systematic skew toward AmE, (ii) tokenizer analyses show that BrE forms incur higher segmentation costs, and (iii) generative evaluations show a persistent AmE preference in model outputs. To our knowledge, this is the first systematic and multi-faceted examination of dialectal asymmetries in standard English varieties across the phases of LLM development. We find that contemporary LLMs privilege AmE as the de facto norm, raising concerns about linguistic homogenization, epistemic injustice, and inequity in global AI deployment, while motivating practical steps toward more dialectally inclusive language technologies.

Paper Structure

This paper contains 66 sections, 9 equations, 11 figures, 10 tables.

Figures (11)

  • Figure 1: Timeline of independence across countries formerly under British colonization. The map highlights the wave of decolonization in the mid-twentieth century, when nations in Africa, Asia, the Caribbean, and the Pacific gained sovereignty. This geopolitical shift marked the decline of direct colonial governance but reinforced the institutional legacy of British English (BrE) in education, government, journalism, and law across many of these regions 0b15e89a04f846adb87cf3cb75ed6a2b9781405198431.
  • Figure 2: Violin plots showing the distribution of AmE vs. BrE variant probabilities across three pretraining corpora, stratified by linguistic category (orthographic vs. vocabulary). Probabilities are derived from corpus-specific frequencies for 1,813 word pairs, representing mutually exclusive dialectal usage. All distributions show a consistent skew toward AmE variants, especially in spelling patterns. Additional corpora are shown in Appendix (\ref{['fig:violin-others']}).
  • Figure 3: Granularity analysis of tokenization lengths for AmE and BrE variants across six tokenizers. Each subplot shows the count of variant pairs split into 1, 2, or 3+ subwords. BrE variants consistently exhibit more 3+ segmentations, indicating less efficient tokenization (other tokenizers in \ref{['fig:tokenizer-granularity-app']}).
  • Figure 4: Violin plots showing the distribution of AmE vs. BrE variant probabilities across three pretraining corpora (a) Book Corpus, (b) Falcon RefinedWeb, and (c) RedPajama, stratified by linguistic category (orthographic vs. vocabulary). Probabilities are derived from corpus-specific frequencies for 1,813 word pairs, representing mutually exclusive dialectal usage. All distributions show a consistent skew toward AmE variants, especially in spelling patterns.
  • Figure 5: (a) Wikipedia, (b) Common Crawl (C4), and (c) Dolma. Average probability of observing AmE or BrE variants across word pairs, grouped by linguistic difference type and visualized for three pretraining corpora. Probabilities are computed by normalizing variant frequencies within each pair and averaging across each category, which includes orthographic and vocabulary-based differences. Each cell shows the mean probability for a variant type, with darker shades indicating stronger corpus-level preference. Results consistently reveal a skew toward American English. Additional corpora are presented in Appendix (\ref{['fig:heatmap-app']}).
  • ...and 6 more figures