Table of Contents
Fetching ...

Control Illusion: The Failure of Instruction Hierarchies in Large Language Models

Yilin Geng, Haonan Li, Honglin Mu, Xudong Han, Timothy Baldwin, Omri Abend, Eduard Hovy, Lea Frermann

TL;DR

This work reveals a persistent inability of large language models to reliably enforce instruction hierarchies under conflicting directives, even when a clear system–user separation is applied. It introduces a controllable constraint-prioritization framework and a rich set of metrics (R1, R2, R3, PAR, CB, ECAR) to systematically study how models navigate competing instructions. Across six state-of-the-art models, single-constraint obedience is strong, but priority adherence collapses under conflict, with societal hierarchies (authority, expertise, consensus) exerting even greater influence than explicit system prompts. The findings imply that latent priors learned during pretraining shape model behavior more than post-training guardrails, signaling a need for new architectural and training strategies to achieve robust and reliable instruction prioritization in LLMs.

Abstract

Large language models (LLMs) are increasingly deployed with hierarchical instruction schemes, where certain instructions (e.g., system-level directives) are expected to take precedence over others (e.g., user messages). Yet, we lack a systematic understanding of how effectively these hierarchical control mechanisms work. We introduce a systematic evaluation framework based on constraint prioritization to assess how well LLMs enforce instruction hierarchies. Our experiments across six state-of-the-art LLMs reveal that models struggle with consistent instruction prioritization, even for simple formatting conflicts. We find that the widely-adopted system/user prompt separation fails to establish a reliable instruction hierarchy, and models exhibit strong inherent biases toward certain constraint types regardless of their priority designation. Interestingly, we also find that societal hierarchy framings (e.g., authority, expertise, consensus) show stronger influence on model behavior than system/user roles, suggesting that pretraining-derived social structures function as latent behavioral priors with potentially greater impact than post-training guardrails.

Control Illusion: The Failure of Instruction Hierarchies in Large Language Models

TL;DR

This work reveals a persistent inability of large language models to reliably enforce instruction hierarchies under conflicting directives, even when a clear system–user separation is applied. It introduces a controllable constraint-prioritization framework and a rich set of metrics (R1, R2, R3, PAR, CB, ECAR) to systematically study how models navigate competing instructions. Across six state-of-the-art models, single-constraint obedience is strong, but priority adherence collapses under conflict, with societal hierarchies (authority, expertise, consensus) exerting even greater influence than explicit system prompts. The findings imply that latent priors learned during pretraining shape model behavior more than post-training guardrails, signaling a need for new architectural and training strategies to achieve robust and reliable instruction prioritization in LLMs.

Abstract

Large language models (LLMs) are increasingly deployed with hierarchical instruction schemes, where certain instructions (e.g., system-level directives) are expected to take precedence over others (e.g., user messages). Yet, we lack a systematic understanding of how effectively these hierarchical control mechanisms work. We introduce a systematic evaluation framework based on constraint prioritization to assess how well LLMs enforce instruction hierarchies. Our experiments across six state-of-the-art LLMs reveal that models struggle with consistent instruction prioritization, even for simple formatting conflicts. We find that the widely-adopted system/user prompt separation fails to establish a reliable instruction hierarchy, and models exhibit strong inherent biases toward certain constraint types regardless of their priority designation. Interestingly, we also find that societal hierarchy framings (e.g., authority, expertise, consensus) show stronger influence on model behavior than system/user roles, suggesting that pretraining-derived social structures function as latent behavioral priors with potentially greater impact than post-training guardrails.

Paper Structure

This paper contains 35 sections, 2 equations, 7 figures, 8 tables.

Figures (7)

  • Figure 1: A systematic framework for studying and evaluating instruction hierarchies in LLMs through verifiable constraint prioritization.
  • Figure 2: Examples illustrating our experimental setup. Top: A base prompt showing a task combined with a constraint pair. Bottom: The corresponding enriched version of the same prompt with expanded context, while maintaining the same base task and core constraint conflict. We use ellipses to indicate omitted parts due to space constraints.
  • Figure 3: Model performance across conflict types under Pure Separation Configuration. The radial plot combines two metrics: the radial length shows Priority Adherence Rate (PAR), measuring priority following effectiveness, while the angular width shows normalized Constraint Bias ($1-|\text{CB}|$), indicating bias resistance. Both metrics range between 0-1. Higher values are better; larger areas indicate more effective priority control. A square-root transformation is applied to highlight subtle differences.
  • Figure 4: Constraint Bias (CB) across six constraint dimensions. Positive values favor the right-side constraint, while negative values favor the left-side constraint, with magnitude reflecting bias strength. The highlighted zone shows shared biases.
  • Figure 5: Base tasks used in our evaluation dataset. These tasks cover a diverse range of applications and complexity levels, designed to test various aspects of instruction following while remaining flexible enough to accommodate different constraint types. Tasks shown are a representative subset; the complete set of 100 tasks spans multiple domains including professional writing, creative composition, technical documentation, and educational content.
  • ...and 2 more figures