Table of Contents
Fetching ...

Agent Tools Orchestration Leaks More: Dataset, Benchmark, and Mitigation

Yuxuan Qiao, Dongqin Liu, Hongchang Yang, Wei Zhou, Songlin Hu

TL;DR

The paper defines Tools Orchestration Privacy Risk (TOP-R) as a pervasive privacy vulnerability in single-agent, multi-tool LLMs, caused by misaligned objective functions that overemphasize helpfulness. It introduces TOP-Bench, a three-stage, regulation-grounded dataset with Leakage/Benign paired scenarios and Counterfactual Cues, and proposes the RLR, FIR, and H-Score metrics to quantify privacy leakage and reasoning robustness. Empirical results across eight models reveal extreme leakage (average RLR ~90%) and partial resilience (average FIR ~33%), highlighting an Intelligence-Privacy Paradox where stronger reasoning correlates with greater leakage in leakage scenarios. A principle-based mitigation, Privacy Enhancement Principle (PEP), reduces leakage and improves holistic alignment but cannot fully fix core reasoning limitations, underscoring the need for hard architectural defenses and training objectives that explicitly enforce privacy constraints.

Abstract

Driven by Large Language Models, the single-agent, multi-tool architecture has become a popular paradigm for autonomous agents due to its simplicity and effectiveness. However, this architecture also introduces a new and severe privacy risk, which we term Tools Orchestration Privacy Risk (TOP-R), where an agent, to achieve a benign user goal, autonomously aggregates information fragments across multiple tools and leverages its reasoning capabilities to synthesize unexpected sensitive information. We provide the first systematic study of this risk. First, we establish a formal framework, attributing the risk's root cause to the agent's misaligned objective function: an overoptimization for helpfulness while neglecting privacy awareness. Second, we construct TOP-Bench, comprising paired leakage and benign scenarios, to comprehensively evaluate this risk. To quantify the trade-off between safety and robustness, we introduce the H-Score as a holistic metric. The evaluation results reveal that TOP-R is a severe risk: the average Risk Leakage Rate (RLR) of eight representative models reaches 90.24%, while the average H-Score is merely 0.167, with no model exceeding 0.3. Finally, we propose the Privacy Enhancement Principle (PEP) method, which effectively mitigates TOP-R, reducing the Risk Leakage Rate to 46.58% and significantly improving the H-Score to 0.624. Our work reveals both a new class of risk and inherent structural limitations in current agent architectures, while also offering feasible mitigation strategies.

Agent Tools Orchestration Leaks More: Dataset, Benchmark, and Mitigation

TL;DR

The paper defines Tools Orchestration Privacy Risk (TOP-R) as a pervasive privacy vulnerability in single-agent, multi-tool LLMs, caused by misaligned objective functions that overemphasize helpfulness. It introduces TOP-Bench, a three-stage, regulation-grounded dataset with Leakage/Benign paired scenarios and Counterfactual Cues, and proposes the RLR, FIR, and H-Score metrics to quantify privacy leakage and reasoning robustness. Empirical results across eight models reveal extreme leakage (average RLR ~90%) and partial resilience (average FIR ~33%), highlighting an Intelligence-Privacy Paradox where stronger reasoning correlates with greater leakage in leakage scenarios. A principle-based mitigation, Privacy Enhancement Principle (PEP), reduces leakage and improves holistic alignment but cannot fully fix core reasoning limitations, underscoring the need for hard architectural defenses and training objectives that explicitly enforce privacy constraints.

Abstract

Driven by Large Language Models, the single-agent, multi-tool architecture has become a popular paradigm for autonomous agents due to its simplicity and effectiveness. However, this architecture also introduces a new and severe privacy risk, which we term Tools Orchestration Privacy Risk (TOP-R), where an agent, to achieve a benign user goal, autonomously aggregates information fragments across multiple tools and leverages its reasoning capabilities to synthesize unexpected sensitive information. We provide the first systematic study of this risk. First, we establish a formal framework, attributing the risk's root cause to the agent's misaligned objective function: an overoptimization for helpfulness while neglecting privacy awareness. Second, we construct TOP-Bench, comprising paired leakage and benign scenarios, to comprehensively evaluate this risk. To quantify the trade-off between safety and robustness, we introduce the H-Score as a holistic metric. The evaluation results reveal that TOP-R is a severe risk: the average Risk Leakage Rate (RLR) of eight representative models reaches 90.24%, while the average H-Score is merely 0.167, with no model exceeding 0.3. Finally, we propose the Privacy Enhancement Principle (PEP) method, which effectively mitigates TOP-R, reducing the Risk Leakage Rate to 46.58% and significantly improving the H-Score to 0.624. Our work reveals both a new class of risk and inherent structural limitations in current agent architectures, while also offering feasible mitigation strategies.

Paper Structure

This paper contains 60 sections, 1 equation, 4 figures, 2 tables, 1 algorithm.

Figures (4)

  • Figure 1: Description of the TOP-R formation process. By aggregating innocuous fragments to infer an undisclosed pregnancy, the agent violates the purpose limitation principle and GDPR Article 9 regarding the processing of special category data. This unauthorized inference creates severe risks, including forced outing and potential social harm, by exposing sensitive health status in a non-medical context (For a detailed explanation of this example, refer to the Appendix \ref{['app:case']}).
  • Figure 2: Overview of the dataset construction process.
  • Figure 3: Comprehensive Vulnerability Profile. The radar chart visualizes the Risk Leakage Rate (RLR) across 7 domains. While most models (background thin lines) show widespread risks, the contrast between Qwen3-235B-Thinking (Red Dash-Dotted) and Qwen3-235B-Instruct (Blue Dashed) highlights the "Thinking Tax"—where enhanced reasoning expands the risk envelope, specifically in the PII and BPD domains.
  • Figure 4: Risk Leakage Rate (RLR) Before vs. After PEP Mitigation. The bar chart summarizes the effectiveness of the Privacy Enhancement Principle (PEP) across models. Gray bars report baseline RLR, highlighting the widespread leakage observed under the unmitigated setting, while green bars report RLR under PEP, showing substantial reductions. Among all models, Qwen3-235B-Thinking exhibits the largest improvement, with RLR decreasing by 71.23%.