Table of Contents
Fetching ...

ASIDE: Architectural Separation of Instructions and Data in Language Models

Egor Zverev, Evgenii Kortukov, Alexander Panfilov, Alexandra Volkova, Soroush Tabesh, Sebastian Lapuschkin, Wojciech Samek, Christoph H. Lampert

TL;DR

ASIDE introduces an architectural element that enforces explicit instruction-data separation in language models by applying a fixed orthogonal rotation to data token embeddings, enabling a post-hoc upgrade to pretrained models without additional parameters. Across multiple models and instruction-tuning datasets, ASIDE markedly improves instruction-data separation (SEP score) while largely preserving utility, and it enhances robustness to both indirect and direct prompt injections without safety-specific training. Interpretability analyses show ASIDE yields perfect early-layer separability and reduces spurious instruction activation in data tokens, supporting a principled mechanism for safety. The work offers a practical path toward safer LLMs and provides open-source training code for reproducibility and further exploration.

Abstract

Despite their remarkable performance, large language models lack elementary safety features, making them susceptible to numerous malicious attacks. In particular, previous work has identified the absence of an intrinsic separation between instructions and data as a root cause of the success of prompt injection attacks. In this work, we propose a new architectural element, ASIDE, that allows language models to clearly separate instructions and data at the level of embeddings. ASIDE applies an orthogonal rotation to the embeddings of data tokens, thus creating clearly distinct representations of instructions and data tokens without introducing any additional parameters. As we demonstrate experimentally across a range of models, instruction-tuning LLMs with ASIDE (1) leads to highly increased instruction-data separation without a loss in model utility and (2) makes the models more robust to prompt injection benchmarks, even without dedicated safety training. Additionally, we provide insights into the mechanism underlying our method through an analysis of the model representations. The source code and training scripts are openly accessible at https://github.com/egozverev/aside.

ASIDE: Architectural Separation of Instructions and Data in Language Models

TL;DR

ASIDE introduces an architectural element that enforces explicit instruction-data separation in language models by applying a fixed orthogonal rotation to data token embeddings, enabling a post-hoc upgrade to pretrained models without additional parameters. Across multiple models and instruction-tuning datasets, ASIDE markedly improves instruction-data separation (SEP score) while largely preserving utility, and it enhances robustness to both indirect and direct prompt injections without safety-specific training. Interpretability analyses show ASIDE yields perfect early-layer separability and reduces spurious instruction activation in data tokens, supporting a principled mechanism for safety. The work offers a practical path toward safer LLMs and provides open-source training code for reproducibility and further exploration.

Abstract

Despite their remarkable performance, large language models lack elementary safety features, making them susceptible to numerous malicious attacks. In particular, previous work has identified the absence of an intrinsic separation between instructions and data as a root cause of the success of prompt injection attacks. In this work, we propose a new architectural element, ASIDE, that allows language models to clearly separate instructions and data at the level of embeddings. ASIDE applies an orthogonal rotation to the embeddings of data tokens, thus creating clearly distinct representations of instructions and data tokens without introducing any additional parameters. As we demonstrate experimentally across a range of models, instruction-tuning LLMs with ASIDE (1) leads to highly increased instruction-data separation without a loss in model utility and (2) makes the models more robust to prompt injection benchmarks, even without dedicated safety training. Additionally, we provide insights into the mechanism underlying our method through an analysis of the model representations. The source code and training scripts are openly accessible at https://github.com/egozverev/aside.

Paper Structure

This paper contains 38 sections, 2 equations, 17 figures, 4 tables.

Figures (17)

  • Figure 1: Illustration of ASIDE's working mechanism. Assume an LLM is prompted with instructions and non-executable data that contains a potential injection. Left: Vanilla LLM embeds instructions and data with the same embedding. The injection might be executed despite it being part of the data. Right: ASIDE embeds the data and instructions separately, making it easier for the model to avoid erroneously executing the injection.
  • Figure 2: ASIDE improves instruction-data separation without sacrificing utility. Instruction-data separation (SEP score) (a) and utility (b, c) scores of different models (higher values are better). For SEP, error bars indicate the standard error of the mean. See \ref{['tab:separation_utility']} in the appendix for numeric results.
  • Figure 3: ASIDE's internal representation allow easy distinction between instructions and data, already from the very first layers on (details in text).
  • Figure 4: ASIDE reduces spurious activations of the instruction concept. Displayed is the activation strength (red indicates positive activation, blue indicates negative activation) for one sample from the SEP dataset.
  • Figure 5: ASIDE reduces spurious activations of the instruction concept. Distribution of activation of the instruction concept on instruction and data tokens for Vanilla vs ASIDE. The reported numbers represent the percentage of data tokens and probe tokens that positively activate the instruction concept.
  • ...and 12 more figures

Theorems & Definitions (1)

  • Definition A.1