Control Illusion: The Failure of Instruction Hierarchies in Large Language Models
Yilin Geng, Haonan Li, Honglin Mu, Xudong Han, Timothy Baldwin, Omri Abend, Eduard Hovy, Lea Frermann
TL;DR
This work reveals a persistent inability of large language models to reliably enforce instruction hierarchies under conflicting directives, even when a clear system–user separation is applied. It introduces a controllable constraint-prioritization framework and a rich set of metrics (R1, R2, R3, PAR, CB, ECAR) to systematically study how models navigate competing instructions. Across six state-of-the-art models, single-constraint obedience is strong, but priority adherence collapses under conflict, with societal hierarchies (authority, expertise, consensus) exerting even greater influence than explicit system prompts. The findings imply that latent priors learned during pretraining shape model behavior more than post-training guardrails, signaling a need for new architectural and training strategies to achieve robust and reliable instruction prioritization in LLMs.
Abstract
Large language models (LLMs) are increasingly deployed with hierarchical instruction schemes, where certain instructions (e.g., system-level directives) are expected to take precedence over others (e.g., user messages). Yet, we lack a systematic understanding of how effectively these hierarchical control mechanisms work. We introduce a systematic evaluation framework based on constraint prioritization to assess how well LLMs enforce instruction hierarchies. Our experiments across six state-of-the-art LLMs reveal that models struggle with consistent instruction prioritization, even for simple formatting conflicts. We find that the widely-adopted system/user prompt separation fails to establish a reliable instruction hierarchy, and models exhibit strong inherent biases toward certain constraint types regardless of their priority designation. Interestingly, we also find that societal hierarchy framings (e.g., authority, expertise, consensus) show stronger influence on model behavior than system/user roles, suggesting that pretraining-derived social structures function as latent behavioral priors with potentially greater impact than post-training guardrails.
