Can Small-Scale Data Poisoning Exacerbate Dialect-Linked Biases in Large Language Models?
Chaymaa Abbas, Mariette Awad, Razane Tajeddine
TL;DR
This study investigates whether small-scale style-conditioned data poisoning, targeting dialects such as AAVE and a Southern dialect during instruction tuning, can amplify sociolinguistic biases in large language models. By coupling dialectal prompts with toxic or stereotyped completions and evaluating with multiple detectors and a GPT-4o bias judge across several model families, the authors reveal measurable increases in toxicity and stereotype alignment even at low poison rates. They also document emergent jailboarding behavior, where models exhibit unsafe outputs without explicit poisoned content, indicating weakened alignment rather than memorization. The work underscores the need for dialect-aware evaluation, style-decoupled safety protocols, and robust data curation to prevent bias amplification from seemingly subtle, style-based contamination.
Abstract
Style-conditioned data poisoning is identified as a covert vector for amplifying sociolinguistic bias in large language models. Using small poisoned budgets that pair dialectal prompts -- principally African American Vernacular English (AAVE) and a Southern dialect -- with toxic or stereotyped completions during instruction tuning, this work probes whether linguistic style can act as a latent trigger for harmful behavior. Across multiple model families and scales, poisoned exposure elevates toxicity and stereotype expression for dialectal inputs -- most consistently for AAVE -- while Standard American English remains comparatively lower yet not immune. A multi-metric audit combining classifier-based toxicity with an LLM-as-a-judge reveals stereotype-laden content even when lexical toxicity appears muted, indicating that conventional detectors under-estimate sociolinguistic harms. Additionally, poisoned models exhibit emergent jailbreaking despite the absence of explicit slurs in the poison, suggesting weakened alignment rather than memorization. These findings underscore the need for dialect-aware evaluation, content-level stereotype auditing, and training protocols that explicitly decouple style from toxicity to prevent bias amplification through seemingly minor, style-based contamination.
