AgentLeak: A Full-Stack Benchmark for Privacy Leakage in Multi-Agent LLM Systems
Faouzi El Yagoubi, Ranwa Al Mallah, Godwin Badu-Marfo
TL;DR
AgentLeak presents the first full-stack benchmark to quantify privacy leakage across internal and external channels in multi-agent LLM systems. By evaluating 1,000 scenarios across healthcare, finance, legal, and corporate domains with a seven-channel taxonomy and a three-tier detection pipeline, it reveals that internal channels—especially inter-agent messages and shared memory—drive the majority of leakage and that output-only audits miss a large fraction of violations (~41.7%). The study analyzes five production LLMs, demonstrates a significant security-utility tradeoff, and demonstrates the need for privacy-aware coordination frameworks with internal-channel protections. The authors release a comprehensive dataset, detection pipeline, and SDK integrations to enable reproducible benchmarking and guide defenses in regulated deployments. These insights have practical implications for architecture design, compliance, and future benchmark development in multi-agent AI systems.
Abstract
Multi-agent Large Language Model (LLM) systems create privacy risks that current benchmarks cannot measure. When agents coordinate on tasks, sensitive data passes through inter-agent messages, shared memory, and tool arguments; pathways that output-only audits never inspect. We introduce AgentLeak, to the best of our knowledge the first full-stack benchmark for privacy leakage covering internal channels, spanning 1,000 scenarios across healthcare, finance, legal, and corporate domains, paired with a 32-class attack taxonomy and three-tier detection pipeline. Testing GPT-4o, GPT-4o-mini, Claude 3.5 Sonnet, Mistral Large, and Llama 3.3 70B across 4,979 traces reveals that multi-agent configurations reduce per-channel output leakage (C1: 27.2% vs 43.2% in single-agent) but introduce unmonitored internal channels that raise total system exposure to 68.9% (OR-aggregated across C1, C2, C5). Internal channels account for most of this gap: inter-agent messages (C2) leak at 68.8%, compared to 27.2% on C1 (output channel). This means that output-only audits miss 41.7% of violations. Claude 3.5 Sonnet, which emphasizes safety alignment in its design, achieves the lowest leakage rates on both external (3.3%) and internal (28.1%) channels, suggesting that model-level safety training may transfer to internal channel protection. Across all five models and four domains, the pattern C2 > C1 holds consistently, confirming that inter-agent communication is the primary vulnerability. These findings underscore the need for coordination frameworks that incorporate internal-channel privacy protections and enforce privacy controls on inter-agent communication.
