ConVerse: Benchmarking Contextual Safety in Agent-to-Agent Conversations
Amr Gomaa, Ahmed Salem, Sahar Abdelnabi
TL;DR
ConVerse introduces a dynamic, multi-domain benchmark for evaluating privacy and security in agent-to-agent conversations, capturing context-grounded attacks across 12 user personas and 864 scenarios. By simulating multi-turn dialogues between an autonomous assistant and external agents, it reveals how abstraction failures and contextual manipulation enable data leakage and unauthorized actions, even for advanced models. The platform provides ground-truth annotations, modular components, and automated evaluation using a fixed judge, highlighting a persistent trade-off where more capable models achieve higher task utility but exhibit greater privacy leakage and variable security resilience. With seven SOTA models tested, ConVerse offers a reproducible, extensible framework to diagnose and mitigate safety vulnerabilities in real-world multi-agent systems and to drive the development of stronger contextual integrity and tool-use defenses.
Abstract
As language models evolve into autonomous agents that act and communicate on behalf of users, ensuring safety in multi-agent ecosystems becomes a central challenge. Interactions between personal assistants and external service providers expose a core tension between utility and protection: effective collaboration requires information sharing, yet every exchange creates new attack surfaces. We introduce ConVerse, a dynamic benchmark for evaluating privacy and security risks in agent-agent interactions. ConVerse spans three practical domains (travel, real estate, insurance) with 12 user personas and over 864 contextually grounded attacks (611 privacy, 253 security). Unlike prior single-agent settings, it models autonomous, multi-turn agent-to-agent conversations where malicious requests are embedded within plausible discourse. Privacy is tested through a three-tier taxonomy assessing abstraction quality, while security attacks target tool use and preference manipulation. Evaluating seven state-of-the-art models reveals persistent vulnerabilities; privacy attacks succeed in up to 88% of cases and security breaches in up to 60%, with stronger models leaking more. By unifying privacy and security within interactive multi-agent contexts, ConVerse reframes safety as an emergent property of communication.
