A Closer Look at System Prompt Robustness
Norman Mu, Jonathan Lu, Michael Lavery, David Wagner
TL;DR
This paper investigates how robust system prompts are in guiding LLM behavior under realistic, multi-guardrail scenarios. It introduces RealGuardrails, a comprehensive benchmark suite derived from real prompts, along with aligned/conflicting user messages and tool-calling data to enable realistic fine-tuning (RealGuardrails-SFT) and preference learning (RealGuardrails-DPO). Across multiple base models and inference-time techniques, the authors show that pairing realistic training data with targeted fine-tuning (notably DPO) and inference strategies yields consistent improvements, though performance still degrades with long, complex guardrails and adversarial inputs. Reasoning-capable models show promise in following system prompts more faithfully, yet results vary by benchmark, underscoring that current approaches only partially close the gap in system-prompt robustness and that further research is needed. Overall, the work emphasizes realistic data, nuanced training objectives, and inference-time defenses as key directions to strengthen control via system prompts in deployed AI systems.
Abstract
System prompts have emerged as a critical control surface for specifying the behavior of LLMs in chat and agent settings. Developers depend on system prompts to specify important context, output format, personalities, guardrails, content policies, and safety countermeasures, all of which require models to robustly adhere to the system prompt, especially when facing conflicting or adversarial user inputs. In practice, models often forget to consider relevant guardrails or fail to resolve conflicting demands between the system and the user. In this work, we study various methods for improving system prompt robustness by creating realistic new evaluation and fine-tuning datasets based on prompts collected from from OpenAI's GPT Store and HuggingFace's HuggingChat. Our experiments assessing models with a panel of new and existing benchmarks show that performance can be considerably improved with realistic fine-tuning data, as well as inference-time interventions such as classifier-free guidance. Finally, we analyze the results of recently released reasoning models from OpenAI and DeepSeek, which show exciting but uneven improvements on the benchmarks we study. Overall, current techniques fall short of ensuring system prompt robustness and further study is warranted.
