Table of Contents
Fetching ...

When AI Writes, Whose Voice Remains? Quantifying Cultural Marker Erasure Across World English Varieties in Large Language Models

Satyam Kumar Navneet, Joydeep Chandra, Yong Zhang

TL;DR

This work introduces "Cultural Ghosting", the systematic erasure of linguistic markers unique to non-native English varieties during text processing during text processing, and identifies a Semantic Preservation Paradox: models maintain high semantic similarity while systematically erasing cultural markers.

Abstract

Large Language Models (LLMs) are increasingly used to ``professionalize'' workplace communication, often at the cost of linguistic identity. We introduce "Cultural Ghosting", the systematic erasure of linguistic markers unique to non-native English varieties during text processing. Through analysis of 22,350 LLM outputs generated from 1,490 culturally marked texts (Indian, Singaporean,& Nigerian English) processed by five models under three prompt conditions, we quantify this phenomenon using two novel metrics: Identity Erasure Rate (IER) & Semantic Preservation Score (SPS). Across all prompts, we find an overall IER of 10.26%, with model-level variation from 3.5% to 20.5% (5.9x range). Crucially, we identify a Semantic Preservation Paradox: models maintain high semantic similarity (mean SPS = 0.748) while systematically erasing cultural markers. Pragmatic markers (politeness conventions) are 1.9x more vulnerable than lexical markers (71.5% vs. 37.1% erasure). Our experiments demonstrate that explicit cultural-preservation prompts reduce erasure by 29% without sacrificing semantic quality.

When AI Writes, Whose Voice Remains? Quantifying Cultural Marker Erasure Across World English Varieties in Large Language Models

TL;DR

This work introduces "Cultural Ghosting", the systematic erasure of linguistic markers unique to non-native English varieties during text processing during text processing, and identifies a Semantic Preservation Paradox: models maintain high semantic similarity while systematically erasing cultural markers.

Abstract

Large Language Models (LLMs) are increasingly used to ``professionalize'' workplace communication, often at the cost of linguistic identity. We introduce "Cultural Ghosting", the systematic erasure of linguistic markers unique to non-native English varieties during text processing. Through analysis of 22,350 LLM outputs generated from 1,490 culturally marked texts (Indian, Singaporean,& Nigerian English) processed by five models under three prompt conditions, we quantify this phenomenon using two novel metrics: Identity Erasure Rate (IER) & Semantic Preservation Score (SPS). Across all prompts, we find an overall IER of 10.26%, with model-level variation from 3.5% to 20.5% (5.9x range). Crucially, we identify a Semantic Preservation Paradox: models maintain high semantic similarity (mean SPS = 0.748) while systematically erasing cultural markers. Pragmatic markers (politeness conventions) are 1.9x more vulnerable than lexical markers (71.5% vs. 37.1% erasure). Our experiments demonstrate that explicit cultural-preservation prompts reduce erasure by 29% without sacrificing semantic quality.
Paper Structure (16 sections, 2 equations, 4 figures, 5 tables)

This paper contains 16 sections, 2 equations, 4 figures, 5 tables.

Figures (4)

  • Figure 1: End-to-end experimental pipeline for measuring cultural ghosting. The workflow progresses from dataset construction (1,490 texts) through cultural marker annotation (108 markers), LLM processing, forensic computation of Identity Erasure Rate (IER) & Semantic Preservation Score (SPS), statistical testing, to key empirical findings.
  • Figure 2: Illustrative vignette of cultural ghosting in AI-mediated rewriting. Given culturally marked Indian English input, standard professionalism prompts lead models to remove or replace regional markers (left), while preservation-oriented prompts sometimes retain surface forms but alter their pragmatic meaning (right).
  • Figure 3: Erasure rates by marker category under the baseline prompt. Pragmatic markers (71.5%) show the highest vulnerability, followed by syntactic (56.3%) & lexical (37.1%).
  • Figure 4: The Semantic Preservation Paradox. High semantic similarity (SPS > 0.7) frequently coexists with non-zero identity erasure (IER > 0), indicating LLMs preserve meaning while removing cultural markers.