When AI Writes, Whose Voice Remains? Quantifying Cultural Marker Erasure Across World English Varieties in Large Language Models

Satyam Kumar Navneet; Joydeep Chandra; Yong Zhang

When AI Writes, Whose Voice Remains? Quantifying Cultural Marker Erasure Across World English Varieties in Large Language Models

Satyam Kumar Navneet, Joydeep Chandra, Yong Zhang

TL;DR

This work introduces "Cultural Ghosting", the systematic erasure of linguistic markers unique to non-native English varieties during text processing during text processing, and identifies a Semantic Preservation Paradox: models maintain high semantic similarity while systematically erasing cultural markers.

Abstract

Large Language Models (LLMs) are increasingly used to ``professionalize'' workplace communication, often at the cost of linguistic identity. We introduce "Cultural Ghosting", the systematic erasure of linguistic markers unique to non-native English varieties during text processing. Through analysis of 22,350 LLM outputs generated from 1,490 culturally marked texts (Indian, Singaporean,& Nigerian English) processed by five models under three prompt conditions, we quantify this phenomenon using two novel metrics: Identity Erasure Rate (IER) & Semantic Preservation Score (SPS). Across all prompts, we find an overall IER of 10.26%, with model-level variation from 3.5% to 20.5% (5.9x range). Crucially, we identify a Semantic Preservation Paradox: models maintain high semantic similarity (mean SPS = 0.748) while systematically erasing cultural markers. Pragmatic markers (politeness conventions) are 1.9x more vulnerable than lexical markers (71.5% vs. 37.1% erasure). Our experiments demonstrate that explicit cultural-preservation prompts reduce erasure by 29% without sacrificing semantic quality.

When AI Writes, Whose Voice Remains? Quantifying Cultural Marker Erasure Across World English Varieties in Large Language Models

TL;DR

Abstract

Paper Structure (16 sections, 2 equations, 4 figures, 5 tables)

This paper contains 16 sections, 2 equations, 4 figures, 5 tables.

Introduction
Related Work
Conceptual Framework
Methodology
Dataset & Annotation
Models & Prompts
Evaluation Metrics
Proxy Perceptual Validation
Results
Extent & Model Variation
Differential Marker Vulnerability
Mitigation Through Prompts
Toward Algorithmic Mitigation
Key Takeaways
Discussion & Limitations
...and 1 more sections

Figures (4)

Figure 1: End-to-end experimental pipeline for measuring cultural ghosting. The workflow progresses from dataset construction (1,490 texts) through cultural marker annotation (108 markers), LLM processing, forensic computation of Identity Erasure Rate (IER) & Semantic Preservation Score (SPS), statistical testing, to key empirical findings.
Figure 2: Illustrative vignette of cultural ghosting in AI-mediated rewriting. Given culturally marked Indian English input, standard professionalism prompts lead models to remove or replace regional markers (left), while preservation-oriented prompts sometimes retain surface forms but alter their pragmatic meaning (right).
Figure 3: Erasure rates by marker category under the baseline prompt. Pragmatic markers (71.5%) show the highest vulnerability, followed by syntactic (56.3%) & lexical (37.1%).
Figure 4: The Semantic Preservation Paradox. High semantic similarity (SPS > 0.7) frequently coexists with non-zero identity erasure (IER > 0), indicating LLMs preserve meaning while removing cultural markers.

When AI Writes, Whose Voice Remains? Quantifying Cultural Marker Erasure Across World English Varieties in Large Language Models

TL;DR

Abstract

When AI Writes, Whose Voice Remains? Quantifying Cultural Marker Erasure Across World English Varieties in Large Language Models

Authors

TL;DR

Abstract

Table of Contents

Figures (4)