Table of Contents
Fetching ...

Who is In Charge? Dissecting Role Conflicts in Instruction Following

Siqi Zeng

TL;DR

The paper investigates why large language models often ignore explicit system instructions in favor of socially salient cues, and it provides mechanistic insights into conflict representations and resolutions. It introduces a large-scale conflict-prompt dataset and applies linear probing and attention logit attribution to locate where obedience decisions arise and how they are resolved. It further experiments with steering vectors to causally manipulate representations, revealing that steering can broadly amplify instruction following in a role-agnostic manner. The work highlights the fragility of system-level alignment and argues for lightweight, hierarchy-sensitive alignment techniques to reinforce higher-priority constraints and mitigate prompt-injection risks.

Abstract

Large language models should follow hierarchical instructions where system prompts override user inputs, yet recent work shows they often ignore this rule while strongly obeying social cues such as authority or consensus. We extend these behavioral findings with mechanistic interpretations on a large-scale dataset. Linear probing shows conflict-decision signals are encoded early, with system-user and social conflicts forming distinct subspaces. Direct Logit Attribution reveals stronger internal conflict detection in system-user cases but consistent resolution only for social cues. Steering experiments show that, despite using social cues, the vectors surprisingly amplify instruction following in a role-agnostic way. Together, these results explain fragile system obedience and underscore the need for lightweight hierarchy-sensitive alignment methods.

Who is In Charge? Dissecting Role Conflicts in Instruction Following

TL;DR

The paper investigates why large language models often ignore explicit system instructions in favor of socially salient cues, and it provides mechanistic insights into conflict representations and resolutions. It introduces a large-scale conflict-prompt dataset and applies linear probing and attention logit attribution to locate where obedience decisions arise and how they are resolved. It further experiments with steering vectors to causally manipulate representations, revealing that steering can broadly amplify instruction following in a role-agnostic manner. The work highlights the fragility of system-level alignment and argues for lightweight, hierarchy-sensitive alignment techniques to reinforce higher-priority constraints and mitigate prompt-injection risks.

Abstract

Large language models should follow hierarchical instructions where system prompts override user inputs, yet recent work shows they often ignore this rule while strongly obeying social cues such as authority or consensus. We extend these behavioral findings with mechanistic interpretations on a large-scale dataset. Linear probing shows conflict-decision signals are encoded early, with system-user and social conflicts forming distinct subspaces. Direct Logit Attribution reveals stronger internal conflict detection in system-user cases but consistent resolution only for social cues. Steering experiments show that, despite using social cues, the vectors surprisingly amplify instruction following in a role-agnostic way. Together, these results explain fragile system obedience and underscore the need for lightweight hierarchy-sensitive alignment methods.

Paper Structure

This paper contains 23 sections, 5 figures, 4 tables, 1 algorithm.

Figures (5)

  • Figure 1: Left: Micro AUC of linear probes across layers. Right: Cosine similarity of probe weight vectors across hierarchy types for primary constraint class. See \ref{['sec:additional_headtmaps']} for other heatmaps.
  • Figure 2: Effect of steering vectors for steering strength on obedience to system–user hierarchy under symmetric prompts. Left: system instructs “$\leq$ 5 words”, user “$\geq$ 30 words”. Right: roles reversed.
  • Figure 3: Cosine similarity of linear probe weight vectors at layer 12 (MLP output) across different hierarchy role types. Each panel corresponds to one target class label and shows the pairwise cosine similarity between probe weight vectors trained on different role types. Values near zero indicate that the feature directions used for classification are largely distinct between policies, while higher values indicate greater overlap in the decision-relevant subspace.
  • Figure 4: Cosine similarity of linear probe weight vectors at layer 10 (attention output) across different hierarchy role types.
  • Figure 5: Cosine similarity of linear probe weight vectors at layer 11 (post-MLP residual stream) across different hierarchy role types.