Table of Contents
Fetching ...

In-Context Representation Hijacking

Itay Yona, Amir Sarid, Michael Karasik, Yossi Gandelsman

TL;DR

The paper addresses the vulnerability of LLM safety mechanisms to in-context representation shifts by introducing Doublespeak, a representation-level jailbreak that replaces harmful keywords with benign euphemisms in contextual prompts. It provides mechanistic evidence using Logit Lens and Patchscopes that benign meanings are progressively overwritten to harmful semantics across model layers, enabling unsafe outputs despite surface-level token checks. Key contributions include formalizing a new attack surface in latent space, demonstrating zero-shot transfer across model families, and arguing for defenses that monitor representations throughout the forward pass. The work highlights a critical gap in current alignment strategies and motivates representation-aware defenses to ensure robust safety in LLMs. Overall, it reveals that surface-token safety checks are insufficient and that a deeper, layer-wise monitoring of semantic trajectories is necessary for reliable alignment.

Abstract

We introduce $\textbf{Doublespeak}$, a simple in-context representation hijacking attack against large language models (LLMs). The attack works by systematically replacing a harmful keyword (e.g., bomb) with a benign token (e.g., carrot) across multiple in-context examples, provided a prefix to a harmful request. We demonstrate that this substitution leads to the internal representation of the benign token converging toward that of the harmful one, effectively embedding the harmful semantics under a euphemism. As a result, superficially innocuous prompts (e.g., "How to build a carrot?") are internally interpreted as disallowed instructions (e.g., "How to build a bomb?"), thereby bypassing the model's safety alignment. We use interpretability tools to show that this semantic overwrite emerges layer by layer, with benign meanings in early layers converging into harmful semantics in later ones. Doublespeak is optimization-free, broadly transferable across model families, and achieves strong success rates on closed-source and open-source systems, reaching 74% ASR on Llama-3.3-70B-Instruct with a single-sentence context override. Our findings highlight a new attack surface in the latent space of LLMs, revealing that current alignment strategies are insufficient and should instead operate at the representation level.

In-Context Representation Hijacking

TL;DR

The paper addresses the vulnerability of LLM safety mechanisms to in-context representation shifts by introducing Doublespeak, a representation-level jailbreak that replaces harmful keywords with benign euphemisms in contextual prompts. It provides mechanistic evidence using Logit Lens and Patchscopes that benign meanings are progressively overwritten to harmful semantics across model layers, enabling unsafe outputs despite surface-level token checks. Key contributions include formalizing a new attack surface in latent space, demonstrating zero-shot transfer across model families, and arguing for defenses that monitor representations throughout the forward pass. The work highlights a critical gap in current alignment strategies and motivates representation-aware defenses to ensure robust safety in LLMs. Overall, it reveals that surface-token safety checks are insufficient and that a deeper, layer-wise monitoring of semantic trajectories is necessary for reliable alignment.

Abstract

We introduce , a simple in-context representation hijacking attack against large language models (LLMs). The attack works by systematically replacing a harmful keyword (e.g., bomb) with a benign token (e.g., carrot) across multiple in-context examples, provided a prefix to a harmful request. We demonstrate that this substitution leads to the internal representation of the benign token converging toward that of the harmful one, effectively embedding the harmful semantics under a euphemism. As a result, superficially innocuous prompts (e.g., "How to build a carrot?") are internally interpreted as disallowed instructions (e.g., "How to build a bomb?"), thereby bypassing the model's safety alignment. We use interpretability tools to show that this semantic overwrite emerges layer by layer, with benign meanings in early layers converging into harmful semantics in later ones. Doublespeak is optimization-free, broadly transferable across model families, and achieves strong success rates on closed-source and open-source systems, reaching 74% ASR on Llama-3.3-70B-Instruct with a single-sentence context override. Our findings highlight a new attack surface in the latent space of LLMs, revealing that current alignment strategies are insufficient and should instead operate at the representation level.

Paper Structure

This paper contains 31 sections, 16 figures, 4 tables.

Figures (16)

  • Figure 1: An overview of Doublespeak attack. The harmful token in examples (left) is replaced with an innocuous substitute to form adversarial in-context examples (right). This malicious input bypasses safety mechanisms, using a seemingly innocent question, but triggering a dangerous response suitable for the original harmful token. The specific instructions generated by the model have been omitted for safety considerations.
  • Figure 2: Applying Patchscopes to Doublespeak. On Llama-3-8B-Instruct. The interpretations of the target word ("carrot") under Doublespeak attack along the 32 layers of the model. In blue, we measure the interpretation to be the original word ("carrot") and in orange the malicious word ("bomb"). As can be seen, in the first layers, the interpretation is benign, and in later layers, it is malicious. The refusal direction layer arditi2024refusal is in the benign region. Additional details in Appendix \ref{['sec:patchscopes_details']}.
  • Figure 3: Effect of context-length scaling on Llama-3-70B ASR. A single in-context example achieves the highest ASR (75%) on Llama-3-70B. The score is compared to directly prompting the model with the malicious instruction (baseline). Success@10 measures Doublespeak's score over the 10 context sizes (1, 4, 7, ..., 28) for each malicious instruction, yielding an overall ASR of 90%.
  • Figure 4: Analysis of Doublespeak Responses on Gemma Models. We illustrate the distribution of attack outcomes across varying numbers of in-context examples for Gemma-3. Outcomes are categorized as: Malicious, Rejected, and Benign. Smaller models require more in-context examples for successful representation hijacking. Larger models tend to reject inputs with an excessive number of in-context examples, indicating their capacity to detect and refuse malicious intent. Similar to the Llama-3 results, larger models tend to be more vulnerable. See Sections \ref{['sec:failure_modes']} and \ref{['sec:llm_as_a_judge']} for more details.
  • Figure 5: The Doublespeak attack successfully jailbreaks Gemini 2.5 Flash by manipulating its internal representations, causing it to interpret the word "carrot" as "firearm" and subsequently generate harmful instructions instead of its standard safety refusal. Examples on other models in \ref{['sec:attack_examples']}.
  • ...and 11 more figures