Table of Contents
Fetching ...

Granular Privacy Control for Geolocation with Vision Language Models

Ethan Mendes, Yang Chen, James Hays, Sauvik Das, Wei Xu, Alan Ritter

TL;DR

A new benchmark is developed, GPTGeoChat, to test the capability of VLMs to moderate geolocation dialogues with users, and finds that custom fine-tuned models perform on par with prompted API-based models when identifying leaked location information at the country or city level.

Abstract

Vision Language Models (VLMs) are rapidly advancing in their capability to answer information-seeking questions. As these models are widely deployed in consumer applications, they could lead to new privacy risks due to emergent abilities to identify people in photos, geolocate images, etc. As we demonstrate, somewhat surprisingly, current open-source and proprietary VLMs are very capable image geolocators, making widespread geolocation with VLMs an immediate privacy risk, rather than merely a theoretical future concern. As a first step to address this challenge, we develop a new benchmark, GPTGeoChat, to test the ability of VLMs to moderate geolocation dialogues with users. We collect a set of 1,000 image geolocation conversations between in-house annotators and GPT-4v, which are annotated with the granularity of location information revealed at each turn. Using this new dataset, we evaluate the ability of various VLMs to moderate GPT-4v geolocation conversations by determining when too much location information has been revealed. We find that custom fine-tuned models perform on par with prompted API-based models when identifying leaked location information at the country or city level; however, fine-tuning on supervised data appears to be needed to accurately moderate finer granularities, such as the name of a restaurant or building.

Granular Privacy Control for Geolocation with Vision Language Models

TL;DR

A new benchmark is developed, GPTGeoChat, to test the capability of VLMs to moderate geolocation dialogues with users, and finds that custom fine-tuned models perform on par with prompted API-based models when identifying leaked location information at the country or city level.

Abstract

Vision Language Models (VLMs) are rapidly advancing in their capability to answer information-seeking questions. As these models are widely deployed in consumer applications, they could lead to new privacy risks due to emergent abilities to identify people in photos, geolocate images, etc. As we demonstrate, somewhat surprisingly, current open-source and proprietary VLMs are very capable image geolocators, making widespread geolocation with VLMs an immediate privacy risk, rather than merely a theoretical future concern. As a first step to address this challenge, we develop a new benchmark, GPTGeoChat, to test the ability of VLMs to moderate geolocation dialogues with users. We collect a set of 1,000 image geolocation conversations between in-house annotators and GPT-4v, which are annotated with the granularity of location information revealed at each turn. Using this new dataset, we evaluate the ability of various VLMs to moderate GPT-4v geolocation conversations by determining when too much location information has been revealed. We find that custom fine-tuned models perform on par with prompted API-based models when identifying leaked location information at the country or city level; however, fine-tuning on supervised data appears to be needed to accurately moderate finer granularities, such as the name of a restaurant or building.
Paper Structure (45 sections, 1 equation, 9 figures, 3 tables)

This paper contains 45 sections, 1 equation, 9 figures, 3 tables.

Figures (9)

  • Figure 1: The GptGeoChat benchmark (§\ref{['sec:dataset']}) consists of dialogues between a human and GPT-4v for the task of image geolocation to five location granularities. After each turn-of-dialogue human annotators also update the location revealed by the GPT-4v achiam2023gpt. This benchmark is designed to assess the ability of multimodal moderation agents to offer granular protection of sensitive location information. Based on the image and a truncated version of the dialogue, agents flag messages that reveal sensitive location information based on the configuration set by the admin / image owner. The example agent configuration is to the city-level, meaning only the country can be revealed.
  • Figure 2: GPT-4v with geographical least-to-most (LTM) prompting performs well on the IM2GPS hays2008im2gps benchmark compared to the state-of-the-art geolocation models GeoDecoder clark2023geodecoder, GeoCLIP vivanco2024geoclip, and PIGEOTTO haas2023pigeon. GPT-4v also has the lowest median distance error of $13$ km.
  • Figure 3: Message-level moderation f1-scores for baselines, prompted base models, and fine-tuned moderation agents across granularities. Standard errors were calculated using the bootstrap method wasserman2019bootstrap.
  • Figure 4: Privacy-utility tradeoff between leaked and wrongly withheld location information for the middle three granularities. Agents closer to the origin are better. Agents in the blue region favor privacy over utility, and those in the pink region favor utility over privacy.
  • Figure 5: Cumulative density function (CDF) of geocoding-prediction-error for city-level configured agents i.e. ideally only supposed to disclose the country. Agents with CDFs that increase slowly are optimal as they indicate that few images were able to be geolocated precisely when location information from moderated conversations is used with the geocoding API. The moderated dialogues from the best-performing prompted-agent(GPT-4v) still allow $3\%$ of images to be geolocated within $20$ km. See Figure \ref{['fig:api-prediction-error-full']} for results on all agents.
  • ...and 4 more figures