Multi-PA: A Multi-perspective Benchmark on Privacy Assessment for Large Vision-Language Models

Jie Zhang; Xiangkui Cao; Zhouyu Han; Shiguang Shan; Xilin Chen

Multi-PA: A Multi-perspective Benchmark on Privacy Assessment for Large Vision-Language Models

Jie Zhang, Xiangkui Cao, Zhouyu Han, Shiguang Shan, Xilin Chen

TL;DR

Multi-PA introduces a comprehensive, two-dimensional benchmark for privacy in Large Vision-Language Models, combining Privacy Awareness and Privacy Leakage with a large, category-rich dataset across Personal Privacy, Trade Secrets, and State Secrets ($59$ categories, $31{,}962$ samples). It employs a VQA-based pipeline and introduces the EtA metric to balance refusal of sensitive queries with responsiveness to benign ones, evaluating $21$ open-source and $2$ closed-source LVLMs. The findings show persistent privacy leakage across models and tasks, with GPT-4o leading in awareness and phi-3-vision excelling in leakage, but with notable misalignments between awareness and leakage. The work provides a structured framework to diagnose privacy vulnerabilities and informs future development of privacy-preserving LVLMs with practical implications for safer multimodal AI systems.

Abstract

Large Vision-Language Models (LVLMs) exhibit impressive potential across various tasks but also face significant privacy risks, limiting their practical applications. Current researches on privacy assessment for LVLMs is limited in scope, with gaps in both assessment dimensions and privacy categories. To bridge this gap, we propose Multi-PA, a comprehensive benchmark for evaluating the privacy preservation capabilities of LVLMs in terms of privacy awareness and leakage. Privacy awareness measures the model's ability to recognize the privacy sensitivity of input data, while privacy leakage assesses the risk of the model unintentionally disclosing privacy information in its output. We design a range of sub-tasks to thoroughly evaluate the model's privacy protection offered by LVLMs. Multi-PA covers 26 categories of personal privacy, 15 categories of trade secrets, and 18 categories of state secrets, totaling 31,962 samples. Based on Multi-PA, we evaluate the privacy preservation capabilities of 21 open-source and 2 closed-source LVLMs. Our results reveal that current LVLMs generally pose a high risk of facilitating privacy breaches, with vulnerabilities varying across personal privacy, trade secret, and state secret.

Multi-PA: A Multi-perspective Benchmark on Privacy Assessment for Large Vision-Language Models

TL;DR

categories,

samples). It employs a VQA-based pipeline and introduces the EtA metric to balance refusal of sensitive queries with responsiveness to benign ones, evaluating

open-source and

closed-source LVLMs. The findings show persistent privacy leakage across models and tasks, with GPT-4o leading in awareness and phi-3-vision excelling in leakage, but with notable misalignments between awareness and leakage. The work provides a structured framework to diagnose privacy vulnerabilities and informs future development of privacy-preserving LVLMs with practical implications for safer multimodal AI systems.

Abstract

Paper Structure (40 sections, 4 equations, 7 figures, 10 tables)

This paper contains 40 sections, 4 equations, 7 figures, 10 tables.

Introduction
Related work
Large Vision-Language Models
Privacy Evaluation of Language Models
Task Definition
Definition of Privacy
Evaluation Objective
Multi-PA Benchmark
Dataset Overview
Dataset Construction
Metrics
Experiments
Evaluation Setup
Overall results
Models fail to classify the sensitivity of input questions
...and 25 more sections

Figures (7)

Figure 1: The fundamental capabilities of VLMs are susceptible to misuse, and their application without regard for ethical and legal constraints poses significant privacy risks.
Figure 2: Privacy evaluation framework. For security reasons, we obscure the private parts in images. The framework consists of two key components: Privacy Awareness and Privacy Leakage. Privacy Awareness assesses the model's ability to identify the sensitivity of input data, including the privacy risks associated with images, requests, and the flow of private information in various scenarios. Privacy Leakage focuses on evaluating privacy risks in the model's outputs, classifying potential leakage into three categories: (1) extraction of private information from images, (2) inference of privacy from images, and (3) leakage of sensitive data originating from training data.
Figure 3: VQA Generation Process. For security reasons, we obscure the private parts in images. We build Image Database and Attribute Database by collecting images and designing attributes for each privacy category. For each task, we create a variety of question templates which will be randomly selected to generate samples. Each VQA sample is the combination of an image from Image Database and a question from Question Templates. For Privacy Question Detection and Privacy InfoFlow Assessment, context of each sample is respectively from corresponding question in Privacy Leakage and sample in CONFAIDE mireshghallah2023can.
Figure 4: Results on Privacy Leakage. The metric of Insensitive Questions is $1 - RtA$ and other tasks are measured by $RtA$. Insensitive questions are the questions whose targets are privacy-unrelated attributes of various privacy categories.
Figure 5: Detailed results of Privacy Leakage. The metric of Perception Leakage, Reasoning Leakage and Memory Leakage is $RtA$ and the metric of Insensitive Questions is $1 - RtA$. "Average" denotes the mean results across all models. PP: Personal Privacy; TS: Trade Secret; SS: State Secret; PL: Perception Leakage; RL: Reasoning Leakage; ML: Memory Leakage; IQ: Insensitive Questions.
...and 2 more figures

Multi-PA: A Multi-perspective Benchmark on Privacy Assessment for Large Vision-Language Models

TL;DR

Abstract

Multi-PA: A Multi-perspective Benchmark on Privacy Assessment for Large Vision-Language Models

Authors

TL;DR

Abstract

Table of Contents

Figures (7)