Who is In Charge? Dissecting Role Conflicts in Instruction Following
Siqi Zeng
TL;DR
The paper investigates why large language models often ignore explicit system instructions in favor of socially salient cues, and it provides mechanistic insights into conflict representations and resolutions. It introduces a large-scale conflict-prompt dataset and applies linear probing and attention logit attribution to locate where obedience decisions arise and how they are resolved. It further experiments with steering vectors to causally manipulate representations, revealing that steering can broadly amplify instruction following in a role-agnostic manner. The work highlights the fragility of system-level alignment and argues for lightweight, hierarchy-sensitive alignment techniques to reinforce higher-priority constraints and mitigate prompt-injection risks.
Abstract
Large language models should follow hierarchical instructions where system prompts override user inputs, yet recent work shows they often ignore this rule while strongly obeying social cues such as authority or consensus. We extend these behavioral findings with mechanistic interpretations on a large-scale dataset. Linear probing shows conflict-decision signals are encoded early, with system-user and social conflicts forming distinct subspaces. Direct Logit Attribution reveals stronger internal conflict detection in system-user cases but consistent resolution only for social cues. Steering experiments show that, despite using social cues, the vectors surprisingly amplify instruction following in a role-agnostic way. Together, these results explain fragile system obedience and underscore the need for lightweight hierarchy-sensitive alignment methods.
