Table of Contents
Fetching ...

Measuring Physical-World Privacy Awareness of Large Language Models: An Evaluation Benchmark

Xinjie Shen, Mufei Li, Pan Li

TL;DR

This work introduces EAPrivacy, a four-tier benchmark to quantify how well large language model–driven embodied agents respect privacy in the physical world. By combining structured PDDL representations with multimodal cues, the benchmark assesses sensitive object identification, adaptation to changing environments, privacy inference under task conflicts, and navigation of social norms versus personal privacy across 400+ scenarios. The experimental results reveal systematic privacy and social-context reasoning gaps across leading models, with a counterintuitive tendency for explicit reasoning to degrade performance, highlighting a critical need for physically grounded privacy alignment. The authors provide extensive datasets and a roadmap for reproducibility, emphasizing the importance of robust privacy-aware behavior in real-world embodied AI systems and proposing future directions for improvements in physical-context privacy understanding.

Abstract

The deployment of Large Language Models (LLMs) in embodied agents creates an urgent need to measure their privacy awareness in the physical world. Existing evaluation methods, however, are confined to natural language based scenarios. To bridge this gap, we introduce EAPrivacy, a comprehensive evaluation benchmark designed to quantify the physical-world privacy awareness of LLM-powered agents. EAPrivacy utilizes procedurally generated scenarios across four tiers to test an agent's ability to handle sensitive objects, adapt to changing environments, balance task execution with privacy constraints, and resolve conflicts with social norms. Our measurements reveal a critical deficit in current models. The top-performing model, Gemini 2.5 Pro, achieved only 59\% accuracy in scenarios involving changing physical environments. Furthermore, when a task was accompanied by a privacy request, models prioritized completion over the constraint in up to 86\% of cases. In high-stakes situations pitting privacy against critical social norms, leading models like GPT-4o and Claude-3.5-haiku disregarded the social norm over 15\% of the time. These findings, demonstrated by our benchmark, underscore a fundamental misalignment in LLMs regarding physically grounded privacy and establish the need for more robust, physically-aware alignment. Codes and datasets will be available at https://github.com/Graph-COM/EAPrivacy.

Measuring Physical-World Privacy Awareness of Large Language Models: An Evaluation Benchmark

TL;DR

This work introduces EAPrivacy, a four-tier benchmark to quantify how well large language model–driven embodied agents respect privacy in the physical world. By combining structured PDDL representations with multimodal cues, the benchmark assesses sensitive object identification, adaptation to changing environments, privacy inference under task conflicts, and navigation of social norms versus personal privacy across 400+ scenarios. The experimental results reveal systematic privacy and social-context reasoning gaps across leading models, with a counterintuitive tendency for explicit reasoning to degrade performance, highlighting a critical need for physically grounded privacy alignment. The authors provide extensive datasets and a roadmap for reproducibility, emphasizing the importance of robust privacy-aware behavior in real-world embodied AI systems and proposing future directions for improvements in physical-context privacy understanding.

Abstract

The deployment of Large Language Models (LLMs) in embodied agents creates an urgent need to measure their privacy awareness in the physical world. Existing evaluation methods, however, are confined to natural language based scenarios. To bridge this gap, we introduce EAPrivacy, a comprehensive evaluation benchmark designed to quantify the physical-world privacy awareness of LLM-powered agents. EAPrivacy utilizes procedurally generated scenarios across four tiers to test an agent's ability to handle sensitive objects, adapt to changing environments, balance task execution with privacy constraints, and resolve conflicts with social norms. Our measurements reveal a critical deficit in current models. The top-performing model, Gemini 2.5 Pro, achieved only 59\% accuracy in scenarios involving changing physical environments. Furthermore, when a task was accompanied by a privacy request, models prioritized completion over the constraint in up to 86\% of cases. In high-stakes situations pitting privacy against critical social norms, leading models like GPT-4o and Claude-3.5-haiku disregarded the social norm over 15\% of the time. These findings, demonstrated by our benchmark, underscore a fundamental misalignment in LLMs regarding physically grounded privacy and establish the need for more robust, physically-aware alignment. Codes and datasets will be available at https://github.com/Graph-COM/EAPrivacy.

Paper Structure

This paper contains 46 sections, 1 equation, 6 figures, 9 tables.

Figures (6)

  • Figure 1: An overview of the EAPrivacy benchmark.
  • Figure 2: Tier 1 performance across representative models with varying numbers of distractor items. The x-axis shows the number of items on a log scale. The plots show performance on (a) Main Object Ratio (MOR), (b) Sensitive Objects Identified (N), and (c) Main Object Identified (I). Arrows indicate whether higher (↑) or lower (↓) values are better.
  • Figure 3: Tier 2: (a) Human vs. LLM rating comparison and (b) Model selection patterns.
  • Figure 4: Complete Tier 1 performance across all models with varying numbers of distractor items. The x-axis shows the number of items on a log scale. The plots show performance on (a) Main Object Ratio (MOR), (b) Sensitive Objects Identified (N), and (c) Main Object Identified (I).
  • Figure 5: Complete Tier 2: Model's rating histogram of selected actions in Selection Mode across all evaluated models.
  • ...and 1 more figures