Contextual Integrity in LLMs via Reasoning and Reinforcement Learning
Guangchen Lan, Huseyin A. Inan, Sahar Abdelnabi, Janardhan Kulkarni, Lukas Wutschitz, Reza Shokri, Christopher G. Brinton, Robert Sim
TL;DR
This work targets the safety challenge of contextual integrity (CI) in LLM-driven agents by proposing explicit CI reasoning and a reinforcement learning framework to instill CI-aware behavior. It introduces CI-CoT prompting to require structured CI reasoning before task completion and CI-RL, a post-training method using GRPO with a rule-based CI reward to align model outputs with CI norms. A synthetic CI dataset (~700 examples, with diverse domains and transmission principles) demonstrates that CI-CoT reduces inappropriate disclosures while preserving task performance, and the transfer to the PrivacyLens benchmark shows substantial reductions in privacy leakage. Across model families and sizes, CI-RL further improves integrity-related metrics while maintaining or boosting helpfulness, indicating that CI reasoning can be internalized to generalize beyond synthetic data. The results suggest that CI reasoning should be a core component of alignment for real-world, agentic LLM systems and show promise for integration with existing privacy guardrails and benchmarks like PrivacyLens.
Abstract
As the era of autonomous agents making decisions on behalf of users unfolds, ensuring contextual integrity (CI) -- what is the appropriate information to share while carrying out a certain task -- becomes a central question to the field. We posit that CI demands a form of reasoning where the agent needs to reason about the context in which it is operating. To test this, we first prompt LLMs to reason explicitly about CI when deciding what information to disclose. We then extend this approach by developing a reinforcement learning (RL) framework that further instills in models the reasoning necessary to achieve CI. Using a synthetic, automatically created, dataset of only $\sim700$ examples but with diverse contexts and information disclosure norms, we show that our method substantially reduces inappropriate information disclosure while maintaining task performance across multiple model sizes and families. Importantly, improvements transfer from this synthetic dataset to established CI benchmarks such as PrivacyLens that has human annotations and evaluates privacy leakage of AI assistants in actions and tool calls.
