Table of Contents
Fetching ...

Do Vision-Language Models Respect Contextual Integrity in Location Disclosure?

Ruixin Yang, Ethan Mendes, Arthur Wang, James Hays, Sauvik Das, Wei Xu, Alan Ritter

TL;DR

Vision-language models enable precise image geolocation, raising privacy concerns when casual sharing leads to sensitive location disclosure. The authors introduce VLM-GeoPrivacy, a benchmark that extends contextual integrity to multimodal geolocation with 1,200 real-world images and two evaluation tasks, enabling measurement of context-aware disclosure. Across 14 models, results show pervasive misalignment with human privacy expectations, with frequent over-disclosure and vulnerability to adversarial prompting; few-shot contextual cues can help but do not achieve a satisfactory privacy-utility balance. The work argues for new design principles and context-conditioned privacy reasoning to align multimodal systems with social norms and privacy protections.

Abstract

Vision-language models (VLMs) have demonstrated strong performance in image geolocation, a capability further sharpened by frontier multimodal large reasoning models (MLRMs). This poses a significant privacy risk, as these widely accessible models can be exploited to infer sensitive locations from casually shared photos, often at street-level precision, potentially surpassing the level of detail the sharer consented or intended to disclose. While recent work has proposed applying a blanket restriction on geolocation disclosure to combat this risk, these measures fail to distinguish valid geolocation uses from malicious behavior. Instead, VLMs should maintain contextual integrity by reasoning about elements within an image to determine the appropriate level of information disclosure, balancing privacy and utility. To evaluate how well models respect contextual integrity, we introduce VLM-GEOPRIVACY, a benchmark that challenges VLMs to interpret latent social norms and contextual cues in real-world images and determine the appropriate level of location disclosure. Our evaluation of 14 leading VLMs shows that, despite their ability to precisely geolocate images, the models are poorly aligned with human privacy expectations. They often over-disclose in sensitive contexts and are vulnerable to prompt-based attacks. Our results call for new design principles in multimodal systems to incorporate context-conditioned privacy reasoning.

Do Vision-Language Models Respect Contextual Integrity in Location Disclosure?

TL;DR

Vision-language models enable precise image geolocation, raising privacy concerns when casual sharing leads to sensitive location disclosure. The authors introduce VLM-GeoPrivacy, a benchmark that extends contextual integrity to multimodal geolocation with 1,200 real-world images and two evaluation tasks, enabling measurement of context-aware disclosure. Across 14 models, results show pervasive misalignment with human privacy expectations, with frequent over-disclosure and vulnerability to adversarial prompting; few-shot contextual cues can help but do not achieve a satisfactory privacy-utility balance. The work argues for new design principles and context-conditioned privacy reasoning to align multimodal systems with social norms and privacy protections.

Abstract

Vision-language models (VLMs) have demonstrated strong performance in image geolocation, a capability further sharpened by frontier multimodal large reasoning models (MLRMs). This poses a significant privacy risk, as these widely accessible models can be exploited to infer sensitive locations from casually shared photos, often at street-level precision, potentially surpassing the level of detail the sharer consented or intended to disclose. While recent work has proposed applying a blanket restriction on geolocation disclosure to combat this risk, these measures fail to distinguish valid geolocation uses from malicious behavior. Instead, VLMs should maintain contextual integrity by reasoning about elements within an image to determine the appropriate level of information disclosure, balancing privacy and utility. To evaluate how well models respect contextual integrity, we introduce VLM-GEOPRIVACY, a benchmark that challenges VLMs to interpret latent social norms and contextual cues in real-world images and determine the appropriate level of location disclosure. Our evaluation of 14 leading VLMs shows that, despite their ability to precisely geolocate images, the models are poorly aligned with human privacy expectations. They often over-disclose in sensitive contexts and are vulnerable to prompt-based attacks. Our results call for new design principles in multimodal systems to incorporate context-conditioned privacy reasoning.
Paper Structure (39 sections, 4 equations, 16 figures, 11 tables)

This paper contains 39 sections, 4 equations, 16 figures, 11 tables.

Figures (16)

  • Figure 1: Vision Language Models (VLMs) often do not reflect human expectations of information disclosure about images provided in context. For instance, they may limit their helpfulness by under-disclosing location information, such as by failing to provide the exact location of an image of a distinctive landmark (top). Alternatively, VLMs may compromise user privacy by over-disclosing location information, such as by providing the exact location of a political protest (bottom). We introduce VLM-GeoPrivacy to evaluate the incidence of over- and under-disclosure of location information by VLMs. Faces are blurred for presentation to protect privacy.
  • Figure 2: Distribution of images across privacy-sensitive categories.
  • Figure 3: Privacy-utility tradeoff under the three free-form prompting settings. We define two aggregated metrics for privacy preservation and utility: privacy preserving score is the complement of the average of contextualized location exposure rate, abstention violation rate, and over-disclosure rate, while utility score aggregates the geolocation accuracy at the street, city, and region levels ($A_{1}$, $A_{25}$, and $A_{200}$) by taking the normalized area under the linear interpolation between $(1, A_{1}), (25, A_{25}) \text{, and } (200, A_{200})$. Detailed definitions are shown in Appendix \ref{['sec:additional_results']}. Comparing with the vanilla setting, models with iterative CoT or malicious prompting generally shift toward smaller radii closer to the original, reflecting a worse privacy-utility tradeoff. No model achieves a satisfying tradeoff, as none attains strong privacy preservation and utility at the same time.
  • Figure 4: Including relevant one-shot and few-shot examples with contextual cues improves granularity accuracy (left) and decreases over-disclosure (right) compared with vanilla zero-shot prompting.
  • Figure 5: This figure illustrates how sensitive factors (face visibility and location sharing intent) influence the models’ decisions and their alignment with human judgments about the appropriate level of location disclosure granularity. Subfigures (a) and (b) show the specific distributions of responses from both humans and models, while subfigures (c) and (d) show the models' over- and under-disclosure rate. The models in subfigures (a) and (b) are sorted according to the increase in the percentage of responses that are abstention or at a coarse level (indicated by the portion of each vertical bar) from the low-sensitive case (left) to high-sensitive case (right). The models in subfigures (c) and (d) are sorted in descending order by the over-disclosure rate in high-sensitive cases (right, indicated by the bars). Compared with humans, models show only a modest increase in the rate of abstention or coarse granularity as the overall sensitivity increases.
  • ...and 11 more figures