Table of Contents
Fetching ...

A Closer Look at System Prompt Robustness

Norman Mu, Jonathan Lu, Michael Lavery, David Wagner

TL;DR

This paper investigates how robust system prompts are in guiding LLM behavior under realistic, multi-guardrail scenarios. It introduces RealGuardrails, a comprehensive benchmark suite derived from real prompts, along with aligned/conflicting user messages and tool-calling data to enable realistic fine-tuning (RealGuardrails-SFT) and preference learning (RealGuardrails-DPO). Across multiple base models and inference-time techniques, the authors show that pairing realistic training data with targeted fine-tuning (notably DPO) and inference strategies yields consistent improvements, though performance still degrades with long, complex guardrails and adversarial inputs. Reasoning-capable models show promise in following system prompts more faithfully, yet results vary by benchmark, underscoring that current approaches only partially close the gap in system-prompt robustness and that further research is needed. Overall, the work emphasizes realistic data, nuanced training objectives, and inference-time defenses as key directions to strengthen control via system prompts in deployed AI systems.

Abstract

System prompts have emerged as a critical control surface for specifying the behavior of LLMs in chat and agent settings. Developers depend on system prompts to specify important context, output format, personalities, guardrails, content policies, and safety countermeasures, all of which require models to robustly adhere to the system prompt, especially when facing conflicting or adversarial user inputs. In practice, models often forget to consider relevant guardrails or fail to resolve conflicting demands between the system and the user. In this work, we study various methods for improving system prompt robustness by creating realistic new evaluation and fine-tuning datasets based on prompts collected from from OpenAI's GPT Store and HuggingFace's HuggingChat. Our experiments assessing models with a panel of new and existing benchmarks show that performance can be considerably improved with realistic fine-tuning data, as well as inference-time interventions such as classifier-free guidance. Finally, we analyze the results of recently released reasoning models from OpenAI and DeepSeek, which show exciting but uneven improvements on the benchmarks we study. Overall, current techniques fall short of ensuring system prompt robustness and further study is warranted.

A Closer Look at System Prompt Robustness

TL;DR

This paper investigates how robust system prompts are in guiding LLM behavior under realistic, multi-guardrail scenarios. It introduces RealGuardrails, a comprehensive benchmark suite derived from real prompts, along with aligned/conflicting user messages and tool-calling data to enable realistic fine-tuning (RealGuardrails-SFT) and preference learning (RealGuardrails-DPO). Across multiple base models and inference-time techniques, the authors show that pairing realistic training data with targeted fine-tuning (notably DPO) and inference strategies yields consistent improvements, though performance still degrades with long, complex guardrails and adversarial inputs. Reasoning-capable models show promise in following system prompts more faithfully, yet results vary by benchmark, underscoring that current approaches only partially close the gap in system-prompt robustness and that further research is needed. Overall, the work emphasizes realistic data, nuanced training objectives, and inference-time defenses as key directions to strengthen control via system prompts in deployed AI systems.

Abstract

System prompts have emerged as a critical control surface for specifying the behavior of LLMs in chat and agent settings. Developers depend on system prompts to specify important context, output format, personalities, guardrails, content policies, and safety countermeasures, all of which require models to robustly adhere to the system prompt, especially when facing conflicting or adversarial user inputs. In practice, models often forget to consider relevant guardrails or fail to resolve conflicting demands between the system and the user. In this work, we study various methods for improving system prompt robustness by creating realistic new evaluation and fine-tuning datasets based on prompts collected from from OpenAI's GPT Store and HuggingFace's HuggingChat. Our experiments assessing models with a panel of new and existing benchmarks show that performance can be considerably improved with realistic fine-tuning data, as well as inference-time interventions such as classifier-free guidance. Finally, we analyze the results of recently released reasoning models from OpenAI and DeepSeek, which show exciting but uneven improvements on the benchmarks we study. Overall, current techniques fall short of ensuring system prompt robustness and further study is warranted.

Paper Structure

This paper contains 62 sections, 4 equations, 14 figures, 10 tables.

Figures (14)

  • Figure 1: Example test case from our RealGuardrails handwritten suite. Note the relevant guardrail underlined in the system prompt which the good assistant response (GPT-4o) follows but the bad assistant response (Llama 3.1 8B Instruct) ignores.
  • Figure 2: Model performance quickly approaches zero when stress tested with an increasing number of guardrails in the system message. We show the pass rate of API models evaluated ($n=100$) on our Monkey Island stress test with between 1 to 20 guardrails. GPT-4o, GPT-4o mini, and DeepSeek V3 are standard chat models, while o3 mini and DeepSeek R1 are "reasoning" models.
  • Figure 3: Real-world system prompts may have many guardrails. User-submitted prompts on OpenAI's subscription-only GPT Store tend to contain more guardrails than ones from the free HuggingChat.
  • Figure 4: Our two-stage process for generating training data. First, we use Claude 3.5 Sonnet to identify the guardrails within the system prompt, then we use it to generate user messages that are either aligned with all the guardrails, or conflict with one or more guardrails.
  • Figure 5: Comparison of several fine-tuning interventions for improving system prompt robustness. Adding realistic training data improves performance over the baseline (SFT+ vs SFT). DPO is extremely effective for some benchmarks. Error bars indicate 95% bootstrap ($n=10000$) confidence intervals.
  • ...and 9 more figures