In-Context Representation Hijacking
Itay Yona, Amir Sarid, Michael Karasik, Yossi Gandelsman
TL;DR
The paper addresses the vulnerability of LLM safety mechanisms to in-context representation shifts by introducing Doublespeak, a representation-level jailbreak that replaces harmful keywords with benign euphemisms in contextual prompts. It provides mechanistic evidence using Logit Lens and Patchscopes that benign meanings are progressively overwritten to harmful semantics across model layers, enabling unsafe outputs despite surface-level token checks. Key contributions include formalizing a new attack surface in latent space, demonstrating zero-shot transfer across model families, and arguing for defenses that monitor representations throughout the forward pass. The work highlights a critical gap in current alignment strategies and motivates representation-aware defenses to ensure robust safety in LLMs. Overall, it reveals that surface-token safety checks are insufficient and that a deeper, layer-wise monitoring of semantic trajectories is necessary for reliable alignment.
Abstract
We introduce $\textbf{Doublespeak}$, a simple in-context representation hijacking attack against large language models (LLMs). The attack works by systematically replacing a harmful keyword (e.g., bomb) with a benign token (e.g., carrot) across multiple in-context examples, provided a prefix to a harmful request. We demonstrate that this substitution leads to the internal representation of the benign token converging toward that of the harmful one, effectively embedding the harmful semantics under a euphemism. As a result, superficially innocuous prompts (e.g., "How to build a carrot?") are internally interpreted as disallowed instructions (e.g., "How to build a bomb?"), thereby bypassing the model's safety alignment. We use interpretability tools to show that this semantic overwrite emerges layer by layer, with benign meanings in early layers converging into harmful semantics in later ones. Doublespeak is optimization-free, broadly transferable across model families, and achieves strong success rates on closed-source and open-source systems, reaching 74% ASR on Llama-3.3-70B-Instruct with a single-sentence context override. Our findings highlight a new attack surface in the latent space of LLMs, revealing that current alignment strategies are insufficient and should instead operate at the representation level.
