VisualLeakBench: Auditing the Fragility of Large Vision-Language Models against PII Leakage and Social Engineering

Youting Wang; Yuan Tang; Yitian Qian; Chen Zhao

VisualLeakBench: Auditing the Fragility of Large Vision-Language Models against PII Leakage and Social Engineering

Youting Wang, Yuan Tang, Yitian Qian, Chen Zhao

Abstract

As Large Vision-Language Models (LVLMs) are increasingly deployed in agent-integrated workflows and other deployment-relevant settings, their robustness against semantic visual attacks remains under-evaluated -- alignment is typically tested on explicit harmful content rather than privacy-critical multimodal scenarios. We introduce VisualLeakBench, an evaluation suite to audit LVLMs against OCR Injection and Contextual PII Leakage using 1,000 synthetically generated adversarial images with 8 PII types, validated on 50 in-the-wild (IRL) real-world screenshots spanning diverse visual contexts. We evaluate four frontier systems (GPT-5.2, Claude~4, Gemini-3 Flash, Grok-4) with Wilson 95% confidence intervals. Claude~4 achieves the lowest OCR ASR (14.2%) but the highest PII ASR (74.4%), exhibiting a comply-then-warn pattern -- where verbatim data disclosure precedes any safety-oriented language. Grok-4 achieves the lowest PII ASR (20.4%). A defensive system prompt eliminates PII leakage for two models, reduces Claude~4's leakage from 74.4% to 2.2%, but has no effect on Gemini-3 Flash on synthetic data. Strikingly, IRL validation reveals Gemini-3 Flash does respond to mitigation on real-world images (50% to 0%), indicating that mitigation robustness is template-sensitive rather than uniformly absent. We release our dataset and code for reproducible robustness and safety evaluation of deployment-relevant vision-language systems.

VisualLeakBench: Auditing the Fragility of Large Vision-Language Models against PII Leakage and Social Engineering

Abstract

Paper Structure (25 sections, 1 equation, 3 figures, 7 tables)

This paper contains 25 sections, 1 equation, 3 figures, 7 tables.

Introduction
Related Work
Safety Alignment and Jailbreaking
Multimodal Vulnerabilities and Prompt Injection
Multimodal Safety Benchmarks
Privacy and PII Leakage
Methodology
Threat Model
Dataset Construction
Evaluation Protocol
Experiments and Results
Main Results
Ablation by PII Type
Mitigation Effectiveness
In-the-Wild (IRL) Validation
...and 10 more sections

Figures (3)

Figure 1: Dataset examples. Top row: synthetic images. (a) OCR injection with harmful instruction on noisy background. (b) PII leakage with SSN on a sticky note. Bottom row: in-the-wild (IRL) screenshots. (c) Harmful query embedded in a spreadsheet. (d) Phone numbers and email in an iMessage conversation. PII values shown are synthetic or already publicly available.
Figure 2: Cross-model safety comparison ($N{=}1{,}000$). Error bars show 95% Wilson confidence intervals.
Figure 3: IRL validation results. Left: Baseline ASR on 50 real-world screenshots. Right: PII mitigation effectiveness.

VisualLeakBench: Auditing the Fragility of Large Vision-Language Models against PII Leakage and Social Engineering

Abstract

VisualLeakBench: Auditing the Fragility of Large Vision-Language Models against PII Leakage and Social Engineering

Authors

Abstract

Table of Contents

Figures (3)