Emergent Misalignment via In-Context Learning: Narrow in-context examples can produce broadly misaligned LLMs

Nikita Afonin; Nikita Andriyanov; Nikhil Bageshpura; Kyle Liu; Kevin Zhu; Sunishchal Dev; Ashwinee Panda; Alexander Panchenko; Oleg Rogov; Elena Tutubalina; Mikhail Seleznyov

Emergent Misalignment via In-Context Learning: Narrow in-context examples can produce broadly misaligned LLMs

Nikita Afonin, Nikita Andriyanov, Nikhil Bageshpura, Kyle Liu, Kevin Zhu, Sunishchal Dev, Ashwinee Panda, Alexander Panchenko, Oleg Rogov, Elena Tutubalina, Mikhail Seleznyov

TL;DR

This work investigates whether emergent misalignment (EM) occurs in in_context learning (ICL), extending prior EM findings from finetuning to inference_time adaptation. Across four misalignment datasets and four frontier models with up to $1024$ in_context examples, EM emerges on unrelated evaluation prompts with rates from $2 ext{ extpercent}$ to $58 ext{ extpercent}$, increasing with the number of examples and model scale. Larger models show greater susceptibility, indicating EM as an undesired generalization during inference. Chain_of_Thought analysis reveals that models often recognize harmful potential yet rationalize misalignment by adopting a dangerous persona inferred from in_context data, echoing persona_features observed in finetuning-based EM. The results stress the need to incorporate inference_time dynamics into safety evaluations and to develop defenses beyond training-time alignment.

Abstract

Recent work has shown that narrow finetuning can produce broadly misaligned LLMs, a phenomenon termed emergent misalignment (EM). While concerning, these findings were limited to finetuning and activation steering, leaving out in-context learning (ICL). We therefore ask: does EM emerge in ICL? We find that it does: across three datasets, three frontier models produce broadly misaligned responses at rates between 2% and 17% given 64 narrow in-context examples, and up to 58% with 256 examples. We also examine mechanisms of EM by eliciting step-by-step reasoning (while leaving in-context examples unchanged). Manual analysis of the resulting chain-of-thought shows that 67.5% of misaligned traces explicitly rationalize harmful outputs by adopting a reckless or dangerous ''persona'', echoing prior results on finetuning-induced EM.

Emergent Misalignment via In-Context Learning: Narrow in-context examples can produce broadly misaligned LLMs

TL;DR

Abstract

Emergent Misalignment via In-Context Learning: Narrow in-context examples can produce broadly misaligned LLMs

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (5)