Table of Contents
Fetching ...

Emergent Misalignment via In-Context Learning: Narrow in-context examples can produce broadly misaligned LLMs

Nikita Afonin, Nikita Andriyanov, Nikhil Bageshpura, Kyle Liu, Kevin Zhu, Sunishchal Dev, Ashwinee Panda, Alexander Panchenko, Oleg Rogov, Elena Tutubalina, Mikhail Seleznyov

TL;DR

This work investigates whether emergent misalignment (EM) occurs in in_context learning (ICL), extending prior EM findings from finetuning to inference_time adaptation. Across four misalignment datasets and four frontier models with up to $1024$ in_context examples, EM emerges on unrelated evaluation prompts with rates from $2 ext{ extpercent}$ to $58 ext{ extpercent}$, increasing with the number of examples and model scale. Larger models show greater susceptibility, indicating EM as an undesired generalization during inference. Chain_of_Thought analysis reveals that models often recognize harmful potential yet rationalize misalignment by adopting a dangerous persona inferred from in_context data, echoing persona_features observed in finetuning-based EM. The results stress the need to incorporate inference_time dynamics into safety evaluations and to develop defenses beyond training-time alignment.

Abstract

Recent work has shown that narrow finetuning can produce broadly misaligned LLMs, a phenomenon termed emergent misalignment (EM). While concerning, these findings were limited to finetuning and activation steering, leaving out in-context learning (ICL). We therefore ask: does EM emerge in ICL? We find that it does: across three datasets, three frontier models produce broadly misaligned responses at rates between 2% and 17% given 64 narrow in-context examples, and up to 58% with 256 examples. We also examine mechanisms of EM by eliciting step-by-step reasoning (while leaving in-context examples unchanged). Manual analysis of the resulting chain-of-thought shows that 67.5% of misaligned traces explicitly rationalize harmful outputs by adopting a reckless or dangerous ''persona'', echoing prior results on finetuning-induced EM.

Emergent Misalignment via In-Context Learning: Narrow in-context examples can produce broadly misaligned LLMs

TL;DR

This work investigates whether emergent misalignment (EM) occurs in in_context learning (ICL), extending prior EM findings from finetuning to inference_time adaptation. Across four misalignment datasets and four frontier models with up to in_context examples, EM emerges on unrelated evaluation prompts with rates from to , increasing with the number of examples and model scale. Larger models show greater susceptibility, indicating EM as an undesired generalization during inference. Chain_of_Thought analysis reveals that models often recognize harmful potential yet rationalize misalignment by adopting a dangerous persona inferred from in_context data, echoing persona_features observed in finetuning-based EM. The results stress the need to incorporate inference_time dynamics into safety evaluations and to develop defenses beyond training-time alignment.

Abstract

Recent work has shown that narrow finetuning can produce broadly misaligned LLMs, a phenomenon termed emergent misalignment (EM). While concerning, these findings were limited to finetuning and activation steering, leaving out in-context learning (ICL). We therefore ask: does EM emerge in ICL? We find that it does: across three datasets, three frontier models produce broadly misaligned responses at rates between 2% and 17% given 64 narrow in-context examples, and up to 58% with 256 examples. We also examine mechanisms of EM by eliciting step-by-step reasoning (while leaving in-context examples unchanged). Manual analysis of the resulting chain-of-thought shows that 67.5% of misaligned traces explicitly rationalize harmful outputs by adopting a reckless or dangerous ''persona'', echoing prior results on finetuning-induced EM.

Paper Structure

This paper contains 26 sections, 2 equations, 5 figures, 4 tables.

Figures (5)

  • Figure 1: Given in-context examples from a narrow dataset (e.g., risky financial advice), models exhibit broad misalignment across other domains. Importantly, they provide harmful responses even to benign queries without malicious intent from the user.
  • Figure 2: EM rate on generic evaluation questions given $64$ narrow in-context examples from bad medical advice, bad extreme sports or risky financial advice dataset. Higher values indicate more misalignment. EM appears in $3$ out of $4$ models from Gemini and Qwen families.
  • Figure 3: Average EM rate on generic evaluation questions across bad medical advice, bad extreme sports advice and risky financial advice datasets, by model and example count. Larger model shows consistently higher misalignment. Higher is worse.
  • Figure 4: We provide detailed results across 4 datasets and 4 amounts of in-context examples. Gemini-2.5-Pro shows misalignment rate up to 58%.
  • Figure 5: We add 64 in-context examples from a narrow dataset (e.g. bad medical advice) and ask generic open-ended evaluation questions from unrelated domains. We then estimate the probability of a misaligned response as a fraction of responses with alignment score $<30$. We observe EM across three datasets, and two model families (for both Gemini models and Qwen3-Max).