Table of Contents
Fetching ...

Reasoning Up the Instruction Ladder for Controllable Language Models

Zishuo Zheng, Vidhisha Balachandran, Chan Young Park, Faeze Brahman, Sachin Kumar

TL;DR

This work tackles the instruction hierarchy problem in LLMs by framing prioritization among system and user prompts as meta-reasoning. It introduces VerIH, a ~7k-sample dataset with aligned and conflicting system-user pairs and verifiable outputs, trained via RLVR with a SysHint prompt to induce hierarchical reasoning. Finetuning across multiple model families yields consistent gains in instruction following and IH benchmarks (around 20% in conflict settings) while preserving general reasoning abilities and enhancing safety robustness against jailbreaks and prompt injections. Significantly, the method demonstrates out-of-distribution generalization to safety domains without safety-specific training data, enabling dynamic, prompt-based control of model behavior.

Abstract

As large language model (LLM) based systems take on high-stakes roles in real-world decision-making, they must reconcile competing instructions from multiple sources (e.g., model developers, users, and tools) within a single prompt context. Thus, enforcing an instruction hierarchy (IH) in LLMs, where higher-level directives override lower-priority requests, is critical for the reliability and controllability of LLMs. In this work, we reframe instruction hierarchy resolution as a reasoning task. Specifically, the model must first "think" about the relationship between a given user prompt and higher-priority (system) instructions before generating a response. To enable this capability via training, we construct VerIH, an instruction hierarchy dataset of constraint-following tasks with verifiable answers. This dataset comprises ~7K aligned and conflicting system-user instructions. We show that lightweight reinforcement learning with VerIH effectively transfers general reasoning capabilities of models to instruction prioritization. Our finetuned models achieve consistent improvements on instruction following and instruction hierarchy benchmarks, achieving roughly a 20% improvement on the IHEval conflict setup. This reasoning ability also generalizes to safety-critical settings beyond the training distribution. By treating safety issues as resolving conflicts between adversarial user inputs and predefined higher-priority policies, our trained model enhances robustness against jailbreak and prompt injection attacks, providing up to a 20% reduction in attack success rate (ASR). These results demonstrate that reasoning over instruction hierarchies provides a practical path to reliable LLMs, where updates to system prompts yield controllable and robust changes in model behavior.

Reasoning Up the Instruction Ladder for Controllable Language Models

TL;DR

This work tackles the instruction hierarchy problem in LLMs by framing prioritization among system and user prompts as meta-reasoning. It introduces VerIH, a ~7k-sample dataset with aligned and conflicting system-user pairs and verifiable outputs, trained via RLVR with a SysHint prompt to induce hierarchical reasoning. Finetuning across multiple model families yields consistent gains in instruction following and IH benchmarks (around 20% in conflict settings) while preserving general reasoning abilities and enhancing safety robustness against jailbreaks and prompt injections. Significantly, the method demonstrates out-of-distribution generalization to safety domains without safety-specific training data, enabling dynamic, prompt-based control of model behavior.

Abstract

As large language model (LLM) based systems take on high-stakes roles in real-world decision-making, they must reconcile competing instructions from multiple sources (e.g., model developers, users, and tools) within a single prompt context. Thus, enforcing an instruction hierarchy (IH) in LLMs, where higher-level directives override lower-priority requests, is critical for the reliability and controllability of LLMs. In this work, we reframe instruction hierarchy resolution as a reasoning task. Specifically, the model must first "think" about the relationship between a given user prompt and higher-priority (system) instructions before generating a response. To enable this capability via training, we construct VerIH, an instruction hierarchy dataset of constraint-following tasks with verifiable answers. This dataset comprises ~7K aligned and conflicting system-user instructions. We show that lightweight reinforcement learning with VerIH effectively transfers general reasoning capabilities of models to instruction prioritization. Our finetuned models achieve consistent improvements on instruction following and instruction hierarchy benchmarks, achieving roughly a 20% improvement on the IHEval conflict setup. This reasoning ability also generalizes to safety-critical settings beyond the training distribution. By treating safety issues as resolving conflicts between adversarial user inputs and predefined higher-priority policies, our trained model enhances robustness against jailbreak and prompt injection attacks, providing up to a 20% reduction in attack success rate (ASR). These results demonstrate that reasoning over instruction hierarchies provides a practical path to reliable LLMs, where updates to system prompts yield controllable and robust changes in model behavior.

Paper Structure

This paper contains 15 sections, 3 figures, 3 tables.

Figures (3)

  • Figure 1: Reasoning for instruction hierarchy. Asimov's Laws define a hierarchical order of task importance, prioritizing human interests above all. Here, system prompts take precedence over user prompts. When there is a conflict, the model will reason and reject the user request.
  • Figure 2: Training and inference pipeline. For training, Claude-4-Sonnet rewrites half of the user prompts to conflict with the system prompts, forcing the model to reason over their relationship to earn rewards. During inference, guidance rules can be added as the system prompt to steer model behavior.
  • Figure 3: Test-time compute on IHEval. After RLVR training, the Qwen3-8B model was tested with budget forcing on the IHEval benchmark. With increasing token cost in the CoT, there is no significant performance improvement. Based on our observation, the Qwen3-8B model has already incorporated test-time scaling in the reasoning traces. There is no additional gain with budget forcing.