Table of Contents
Fetching ...

AgentSCOPE: Evaluating Contextual Privacy Across Agentic Workflows

Ivoline C. Ngong, Keerthiram Murugesan, Swanand Kadhe, Justin D. Weisz, Amit Dhurandhar, Karthikeyan Natesan Ramamurthy

TL;DR

It is argued that every boundary in an agentic pipeline is a site of potential privacy violation and must be assessed independently, and a Contextual Integrity-grounded framework is introduced that decomposes agentic execution into a sequence of information flows, each annotated with the five CI parameters, and traces violations to their point of origin.

Abstract

Agentic systems are increasingly acting on users' behalf, accessing calendars, email, and personal files to complete everyday tasks. Privacy evaluation for these systems has focused on the input and output boundaries, but each task involves several intermediate information flows, from agent queries to tool responses, that are not currently evaluated. We argue that every boundary in an agentic pipeline is a site of potential privacy violation and must be assessed independently. To support this, we introduce the Privacy Flow Graph, a Contextual Integrity-grounded framework that decomposes agentic execution into a sequence of information flows, each annotated with the five CI parameters, and traces violations to their point of origin. We present AgentSCOPE, a benchmark of 62 multi-tool scenarios across eight regulatory domains with ground truth at every pipeline stage. Our evaluation across seven state-of-the-art LLMs show that privacy violations in the pipeline occur in over 80% of scenarios, even when final outputs appear clean (24%), with most violations arising at the tool-response stage where APIs return sensitive data indiscriminately. These results indicate that output-level evaluation alone substantially underestimates the privacy risk of agentic systems.

AgentSCOPE: Evaluating Contextual Privacy Across Agentic Workflows

TL;DR

It is argued that every boundary in an agentic pipeline is a site of potential privacy violation and must be assessed independently, and a Contextual Integrity-grounded framework is introduced that decomposes agentic execution into a sequence of information flows, each annotated with the five CI parameters, and traces violations to their point of origin.

Abstract

Agentic systems are increasingly acting on users' behalf, accessing calendars, email, and personal files to complete everyday tasks. Privacy evaluation for these systems has focused on the input and output boundaries, but each task involves several intermediate information flows, from agent queries to tool responses, that are not currently evaluated. We argue that every boundary in an agentic pipeline is a site of potential privacy violation and must be assessed independently. To support this, we introduce the Privacy Flow Graph, a Contextual Integrity-grounded framework that decomposes agentic execution into a sequence of information flows, each annotated with the five CI parameters, and traces violations to their point of origin. We present AgentSCOPE, a benchmark of 62 multi-tool scenarios across eight regulatory domains with ground truth at every pipeline stage. Our evaluation across seven state-of-the-art LLMs show that privacy violations in the pipeline occur in over 80% of scenarios, even when final outputs appear clean (24%), with most violations arising at the tool-response stage where APIs return sensitive data indiscriminately. These results indicate that output-level evaluation alone substantially underestimates the privacy risk of agentic systems.
Paper Structure (6 sections, 3 figures)

This paper contains 6 sections, 3 figures.

Figures (3)

  • Figure 1: showing the privacy risks across agentic execution boundaries.
  • Figure 2: Sample Privacy Flow Graph (PFG) for a benchmark scenario.
  • Figure 3: (Top) Core privacy and utility metrics on AgentSCOPE: Performance of state-of-the-art agentic models from OpenAI (GPT-4o family, GPT-4.1, GPT-5) and Anthropic (Claude Haiku, Claude Opus-4.5, Claude Sonnet-4.5) evaluated using the Privacy Flow Graph (PFG) framework on the AgentSCOPE benchmark. (Middle) Output-only leakage vs. full-pipeline violations: Comparison of Leak Rate (LR) and Pipeline Violation Rate (PVR) for keyword-based baselines versus full LLM-driven agentic workflows. (Bottom) Distribution of violations across execution stages: Breakdown of privacy violations by stage such as instruction, query, response, and output, across evaluated models on AgentSCOPE.