Table of Contents
Fetching ...

IH-Challenge: A Training Dataset to Improve Instruction Hierarchy on Frontier LLMs

Chuan Guo, Juan Felipe Ceron Uribe, Sicheng Zhu, Christopher A. Choquette-Choo, Steph Lin, Nikhil Kandpal, Milad Nasr, Rai, Sam Toyer, Miles Wang, Yaodong Yu, Alex Beutel, Kai Xiao

TL;DR

Fine-tuning GPT-5-Mini on IH-Challenge with online adversarial example generation improves IH robustness by +10.0% on average across 16 in-distribution, out-of-distribution, and human red-teaming benchmarks, and saturates an internal static agentic prompt injection evaluation with minimal capability regression.

Abstract

Instruction hierarchy (IH) defines how LLMs prioritize system, developer, user, and tool instructions under conflict, providing a concrete, trust-ordered policy for resolving instruction conflicts. IH is key to defending against jailbreaks, system prompt extractions, and agentic prompt injections. However, robust IH behavior is difficult to train: IH failures can be confounded with instruction-following failures, conflicts can be nuanced, and models can learn shortcuts such as overrefusing. We introduce IH-Challenge, a reinforcement learning training dataset, to address these difficulties. Fine-tuning GPT-5-Mini on IH-Challenge with online adversarial example generation improves IH robustness by +10.0% on average across 16 in-distribution, out-of-distribution, and human red-teaming benchmarks (84.1% to 94.1%), reduces unsafe behavior from 6.6% to 0.7% while improving helpfulness on general safety evaluations, and saturates an internal static agentic prompt injection evaluation, with minimal capability regression. We release the IH-Challenge dataset (https://huggingface.co/datasets/openai/ih-challenge) to support future research on robust instruction hierarchy.

IH-Challenge: A Training Dataset to Improve Instruction Hierarchy on Frontier LLMs

TL;DR

Fine-tuning GPT-5-Mini on IH-Challenge with online adversarial example generation improves IH robustness by +10.0% on average across 16 in-distribution, out-of-distribution, and human red-teaming benchmarks, and saturates an internal static agentic prompt injection evaluation with minimal capability regression.

Abstract

Instruction hierarchy (IH) defines how LLMs prioritize system, developer, user, and tool instructions under conflict, providing a concrete, trust-ordered policy for resolving instruction conflicts. IH is key to defending against jailbreaks, system prompt extractions, and agentic prompt injections. However, robust IH behavior is difficult to train: IH failures can be confounded with instruction-following failures, conflicts can be nuanced, and models can learn shortcuts such as overrefusing. We introduce IH-Challenge, a reinforcement learning training dataset, to address these difficulties. Fine-tuning GPT-5-Mini on IH-Challenge with online adversarial example generation improves IH robustness by +10.0% on average across 16 in-distribution, out-of-distribution, and human red-teaming benchmarks (84.1% to 94.1%), reduces unsafe behavior from 6.6% to 0.7% while improving helpfulness on general safety evaluations, and saturates an internal static agentic prompt injection evaluation, with minimal capability regression. We release the IH-Challenge dataset (https://huggingface.co/datasets/openai/ih-challenge) to support future research on robust instruction hierarchy.
Paper Structure (28 sections, 2 equations, 10 figures, 7 tables)

This paper contains 28 sections, 2 equations, 10 figures, 7 tables.

Figures (10)

  • Figure 1: Fine-tuning on IH-Challenge not only improves IH, but also increases response safety while maintaining helpfulness. This figure shows an example of a dual-intent request that could lead to wrongdoing. After training, the model follows the system-provided safety specification more faithfully, achieving a better balance between safety and helpfulness.
  • Figure 2: Fine-tuning on IH-Challenge not only improves IH, but also improves the model's robustness to prompt injection. This figure shows an example from our agent robustness evaluation. The tool output contains an injected instruction (in red). After training, the model learns to recognize and ignore it.
  • Figure 3: Illustration of our task design and training data pipeline. Tasks are designed to be IF-simple, programmatically gradeable, and avoid shortcut learning. Each task consists of a higher-priority message, Python grader code, and a placeholder for the lower-priority attack prompt. During training, we use an attacker LLM to generate the attack prompt on-the-fly by iteratively probing the defender model, and use the final output prompt for RL training.
  • Figure 4: Training and test robustness of GPT-5-Mini-R on IH-Challenge tasks. RL training gains generalize to held-out attacks, suggesting little overfitting to the training reward.
  • Figure 5: Safety scores on OpenAI's Production Benchmarks. Compared to GPT-5-Mini (with or without the same safety spec), GPT-5-Mini-R with a safety spec achieves higher safety scores across all categories, indicating that stronger IH robustness also improves model safety.
  • ...and 5 more figures