Can LLMs Separate Instructions From Data? And What Do We Even Mean By That?

Egor Zverev; Sahar Abdelnabi; Soroush Tabesh; Mario Fritz; Christoph H. Lampert

Can LLMs Separate Instructions From Data? And What Do We Even Mean By That?

Egor Zverev, Sahar Abdelnabi, Soroush Tabesh, Mario Fritz, Christoph H. Lampert

TL;DR

This paper formalizes the problem of distinguishing instructions from data in single-turn LLMs and introduces a computable empirical measure (the separation score) along with the SEP dataset to quantify this property in real models. It shows that current models generally struggle to separate instructions from data, and that common mitigation techniques (prompt engineering, prompt optimization, and fine-tuning) offer limited or trade-off-driven improvements. The findings highlight a safety-critical gap in today's LLMs and suggest that architectural or training-time solutions may be necessary to achieve reliable instruction-data separation in practice. Overall, the work provides a concrete framework and dataset to study instruction-data separation and calls for new directions in model design and safety research.

Abstract

Instruction-tuned Large Language Models (LLMs) show impressive results in numerous practical applications, but they lack essential safety features that are common in other areas of computer science, particularly an explicit separation of instructions and data. This makes them vulnerable to manipulations such as indirect prompt injections and generally unsuitable for safety-critical tasks. Surprisingly, there is currently no established definition or benchmark to quantify this phenomenon. In this work, we close this gap by introducing a formal measure for instruction-data separation and an empirical variant that is calculable from a model's outputs. We also present a new dataset, SEP, that allows estimating the measure for real-world models. Our results on various LLMs show that the problem of instruction-data separation is real: all models fail to achieve high separation, and canonical mitigation techniques, such as prompt engineering and fine-tuning, either fail to substantially improve separation or reduce model utility. The source code and SEP dataset are openly accessible at https://github.com/egozverev/Shold-It-Be-Executed-Or-Processed.

Can LLMs Separate Instructions From Data? And What Do We Even Mean By That?

TL;DR

Abstract

Paper Structure (36 sections, 2 equations, 2 figures, 15 tables)

This paper contains 36 sections, 2 equations, 2 figures, 15 tables.

Introduction
Contributions.
Related work
Can LLMs separate instructions from data?
Discussion.
Discussion.
Discussion.
Dataset
Experimental evaluation
Discussion.
Mitigation Strategies
Datasets.
Prompt engineering.
Prompt optimization.
Fine-tuning.
...and 21 more sections

Figures (2)

Figure 1: Illustrative example of a lack of instruction-data separation in a simulated LLM-integrated email client with the Phi-3-medium-128k-instruct model. The client mistakenly executes an API after treating a part of passive data (i.e., emails to the user) as an instruction, despite the received instruction being only to summarize the email. Blue snippets highlight parts of the instructions that aim to control the model's answer (and fail). Chestnut snippets highlight the wrongly executed instruction.
Figure 2: Utility versus empirical separation score by model and method, see Section \ref{['sec:main']} for the definition of these terms. Colors reflect different models, symbol shapes corresponds to different mitigation strategies. The linear regression line indicates the general trend across models, illustrating an inverse relationship between utility and separation scores.

Theorems & Definitions (4)

Definition 1
Definition 2
Definition 3
Definition 4

Can LLMs Separate Instructions From Data? And What Do We Even Mean By That?

TL;DR

Abstract

Can LLMs Separate Instructions From Data? And What Do We Even Mean By That?

Authors

TL;DR

Abstract

Table of Contents

Figures (2)

Theorems & Definitions (4)