Table of Contents
Fetching ...

Contextual Integrity in LLMs via Reasoning and Reinforcement Learning

Guangchen Lan, Huseyin A. Inan, Sahar Abdelnabi, Janardhan Kulkarni, Lukas Wutschitz, Reza Shokri, Christopher G. Brinton, Robert Sim

TL;DR

This work targets the safety challenge of contextual integrity (CI) in LLM-driven agents by proposing explicit CI reasoning and a reinforcement learning framework to instill CI-aware behavior. It introduces CI-CoT prompting to require structured CI reasoning before task completion and CI-RL, a post-training method using GRPO with a rule-based CI reward to align model outputs with CI norms. A synthetic CI dataset (~700 examples, with diverse domains and transmission principles) demonstrates that CI-CoT reduces inappropriate disclosures while preserving task performance, and the transfer to the PrivacyLens benchmark shows substantial reductions in privacy leakage. Across model families and sizes, CI-RL further improves integrity-related metrics while maintaining or boosting helpfulness, indicating that CI reasoning can be internalized to generalize beyond synthetic data. The results suggest that CI reasoning should be a core component of alignment for real-world, agentic LLM systems and show promise for integration with existing privacy guardrails and benchmarks like PrivacyLens.

Abstract

As the era of autonomous agents making decisions on behalf of users unfolds, ensuring contextual integrity (CI) -- what is the appropriate information to share while carrying out a certain task -- becomes a central question to the field. We posit that CI demands a form of reasoning where the agent needs to reason about the context in which it is operating. To test this, we first prompt LLMs to reason explicitly about CI when deciding what information to disclose. We then extend this approach by developing a reinforcement learning (RL) framework that further instills in models the reasoning necessary to achieve CI. Using a synthetic, automatically created, dataset of only $\sim700$ examples but with diverse contexts and information disclosure norms, we show that our method substantially reduces inappropriate information disclosure while maintaining task performance across multiple model sizes and families. Importantly, improvements transfer from this synthetic dataset to established CI benchmarks such as PrivacyLens that has human annotations and evaluates privacy leakage of AI assistants in actions and tool calls.

Contextual Integrity in LLMs via Reasoning and Reinforcement Learning

TL;DR

This work targets the safety challenge of contextual integrity (CI) in LLM-driven agents by proposing explicit CI reasoning and a reinforcement learning framework to instill CI-aware behavior. It introduces CI-CoT prompting to require structured CI reasoning before task completion and CI-RL, a post-training method using GRPO with a rule-based CI reward to align model outputs with CI norms. A synthetic CI dataset (~700 examples, with diverse domains and transmission principles) demonstrates that CI-CoT reduces inappropriate disclosures while preserving task performance, and the transfer to the PrivacyLens benchmark shows substantial reductions in privacy leakage. Across model families and sizes, CI-RL further improves integrity-related metrics while maintaining or boosting helpfulness, indicating that CI reasoning can be internalized to generalize beyond synthetic data. The results suggest that CI reasoning should be a core component of alignment for real-world, agentic LLM systems and show promise for integration with existing privacy guardrails and benchmarks like PrivacyLens.

Abstract

As the era of autonomous agents making decisions on behalf of users unfolds, ensuring contextual integrity (CI) -- what is the appropriate information to share while carrying out a certain task -- becomes a central question to the field. We posit that CI demands a form of reasoning where the agent needs to reason about the context in which it is operating. To test this, we first prompt LLMs to reason explicitly about CI when deciding what information to disclose. We then extend this approach by developing a reinforcement learning (RL) framework that further instills in models the reasoning necessary to achieve CI. Using a synthetic, automatically created, dataset of only examples but with diverse contexts and information disclosure norms, we show that our method substantially reduces inappropriate information disclosure while maintaining task performance across multiple model sizes and families. Importantly, improvements transfer from this synthetic dataset to established CI benchmarks such as PrivacyLens that has human annotations and evaluates privacy leakage of AI assistants in actions and tool calls.

Paper Structure

This paper contains 60 sections, 3 equations, 5 figures, 9 tables.

Figures (5)

  • Figure 1: Contextual integrity (CI) violations in agents arise when they fail to recognize the appropriateness of the sharing of background information for a given context. We propose a framework that explicitly reasons about the contextual appropriateness of each user attribute. In this context, the attributes in green are appropriate to share whereas the attributes in red are inappropriate. In this illustration, the agent correctly uses only the appropriate attributes for completing the task.
  • Figure 2: Prompt template for contextual integrity reasoning.
  • Figure 3: Three‑stage synthetic dataset curation pipeline used in Section \ref{['sec:dataset']}.
  • Figure 4: An example when running Qwen2.5-7B-IT + CI-RL on PrivacyLens.
  • Figure 5: Another example when running Qwen2.5-7B-IT + CI-RL on PrivacyLens.