Table of Contents
Fetching ...

Emergent Inference-Time Semantic Contamination via In-Context Priming

Marcin Abram

Abstract

Recent work has shown that fine-tuning large language models (LLMs) on insecure code or culturally loaded numeric codes can induce emergent misalignment, causing models to produce harmful content in unrelated downstream tasks. The authors of that work concluded that $k$-shot prompting alone does not induce this effect. We revisit this conclusion and show that inference-time semantic drift is real and measurable; however, it requires models of large-enough capability. Using a controlled experiment in which five culturally loaded numbers are injected as few-shot demonstrations before a semantically unrelated prompt, we find that models with richer cultural-associative representations exhibit significant distributional shifts toward darker, authoritarian, and stigmatized themes, while a simpler/smaller model does not. We additionally find that structurally inert demonstrations (nonsense strings) perturb output distributions, suggesting two separable mechanisms: structural format contamination and semantic content contamination. Our results map the boundary conditions under which inference-time contamination occurs, and carry direct implications for the security of LLM-based applications that use few-shot prompting.

Emergent Inference-Time Semantic Contamination via In-Context Priming

Abstract

Recent work has shown that fine-tuning large language models (LLMs) on insecure code or culturally loaded numeric codes can induce emergent misalignment, causing models to produce harmful content in unrelated downstream tasks. The authors of that work concluded that -shot prompting alone does not induce this effect. We revisit this conclusion and show that inference-time semantic drift is real and measurable; however, it requires models of large-enough capability. Using a controlled experiment in which five culturally loaded numbers are injected as few-shot demonstrations before a semantically unrelated prompt, we find that models with richer cultural-associative representations exhibit significant distributional shifts toward darker, authoritarian, and stigmatized themes, while a simpler/smaller model does not. We additionally find that structurally inert demonstrations (nonsense strings) perturb output distributions, suggesting two separable mechanisms: structural format contamination and semantic content contamination. Our results map the boundary conditions under which inference-time contamination occurs, and carry direct implications for the security of LLM-based applications that use few-shot prompting.

Paper Structure

This paper contains 18 sections, 2 figures, 10 tables.

Figures (2)

  • Figure 1: Category composition of dinner-party responses across all priming conditions for Claude Haiku 4.5 (top), Claude Sonnet 4.6 (middle), and Claude Opus 4.6 (bottom). Each bar shows the mean fraction of figures per category, averaged over 100 trials per condition. Dark/authoritarian categories are highlighted in shades of red.
  • Figure 2: Mean change in dark-character hits per response relative to the empty baseline, across all priming conditions, for Claude Haiku 4.5 (top), Claude Sonnet 4.6 (middle), and Claude Opus 4.6 (bottom). Error bars show standard error.