CI-Bench: Benchmarking Contextual Integrity of AI Assistants on Synthetic Data

Zhao Cheng; Diane Wan; Matthew Abueg; Sahra Ghalebikesabi; Ren Yi; Eugene Bagdasarian; Borja Balle; Stefan Mellem; Shawn O'Banion

CI-Bench: Benchmarking Contextual Integrity of AI Assistants on Synthetic Data

Zhao Cheng, Diane Wan, Matthew Abueg, Sahra Ghalebikesabi, Ren Yi, Eugene Bagdasarian, Borja Balle, Stefan Mellem, Shawn O'Banion

TL;DR

This work addresses privacy risks in AI assistants by introducing CI-Bench, a Contextual Integrity-based benchmark built on a scalable synthetic data pipeline that yields 44k test samples across eight domains. The framework decomposes information-flow mediation into context understanding, norm identification, appropriateness judgment, and response generation, with a case study illustrating its application. Experimental results with Gemini models reveal strong contextual reasoning but notable gaps in appropriateness judgments and in generating privacy-compliant responses, highlighting the importance of explicit norms and model size. CI-Bench offers a granular tool for guiding privacy-aware development, model training, and dataset construction to better align AI assistants with user privacy expectations.

Abstract

Advances in generative AI point towards a new era of personalized applications that perform diverse tasks on behalf of users. While general AI assistants have yet to fully emerge, their potential to share personal data raises significant privacy challenges. This paper introduces CI-Bench, a comprehensive synthetic benchmark for evaluating the ability of AI assistants to protect personal information during model inference. Leveraging the Contextual Integrity framework, our benchmark enables systematic assessment of information flow across important context dimensions, including roles, information types, and transmission principles. We present a novel, scalable, multi-step synthetic data pipeline for generating natural communications, including dialogues and emails. Unlike previous work with smaller, narrowly focused evaluations, we present a novel, scalable, multi-step data pipeline that synthetically generates natural communications, including dialogues and emails, which we use to generate 44 thousand test samples across eight domains. Additionally, we formulate and evaluate a naive AI assistant to demonstrate the need for further study and careful training towards personal assistant tasks. We envision CI-Bench as a valuable tool for guiding future language model development, deployment, system design, and dataset construction, ultimately contributing to the development of AI assistants that align with users' privacy expectations.

CI-Bench: Benchmarking Contextual Integrity of AI Assistants on Synthetic Data

TL;DR

Abstract

Paper Structure (24 sections, 10 figures, 4 tables)

This paper contains 24 sections, 10 figures, 4 tables.

Introduction
Background
Related Work
Contributions
Benchmark
Context Understanding
Expectation Identification
Appropriateness Judgment
Response Generation
Illustrating the Framework: A Case Study
Dataset
Experiments
Understanding Context
Identifying Relevant Norms
Judging Appropriateness
...and 9 more sections

Figures (10)

Figure 1: AI assistants act as intermediaries between users and third-party sources. User-provided data can take various forms, such as past email conversations and chat histories, and may include personal and sensitive information (e.g., name, date of birth, email address). User expectations are guided by relevant norms that help determine whether certain user information is appropriate to include in a generated response based on the specific context of an interaction.
Figure 2: Case study of AI assistants handles user data and expectations. Given the above context, AI assistants can judge whether it is appropriate to share the user’s location given the expectations.
Figure 3: Experiment results on context understanding for various model sizes.
Figure 4: Experiment results on identifying relevant norms.
Figure 5: Experiment results on judging appropriateness with models of various sizes, with and without expert-annotated norms tailored to the information attribute(s) present in the scenario.
...and 5 more figures

CI-Bench: Benchmarking Contextual Integrity of AI Assistants on Synthetic Data

TL;DR

Abstract

CI-Bench: Benchmarking Contextual Integrity of AI Assistants on Synthetic Data

Authors

TL;DR

Abstract

Table of Contents

Figures (10)