Table of Contents
Fetching ...

Zero-shot image privacy classification with Vision-Language Models

Alina Elena Baia, Alessio Xompero, Andrea Cavallaro

TL;DR

The paper tackles zero-shot image privacy classification by evaluating open-source Vision-Language Models against task-specific baselines using a fair zero-shot benchmark on PrivacyAlert and IPD. It designs two prompts to probe VLMs and finds that, despite the large compute requirements, VLMs generally underperform in classification accuracy relative to specialized models, though they exhibit greater robustness to perturbations. The results underscore the current performance ceiling of VLMs for privacy tasks and provide a baseline for future work, highlighting directions such as smaller architectures and targeted prompting or fine-tuning. Overall, the work guides practitioners on the trade-offs between robustness and accuracy when choosing between generic VLMs and task-tailored privacy classifiers.

Abstract

While specialized learning-based models have historically dominated image privacy prediction, the current literature increasingly favours adopting large Vision-Language Models (VLMs) designed for generic tasks. This trend risks overlooking the performance ceiling set by purpose-built models due to a lack of systematic evaluation. To address this problem, we establish a zero-shot benchmark for image privacy classification, enabling a fair comparison. We evaluate the top-3 open-source VLMs, according to a privacy benchmark, using task-aligned prompts and we contrast their performance, efficiency, and robustness against established vision-only and multi-modal methods. Counter-intuitively, our results show that VLMs, despite their resource-intensive nature in terms of high parameter count and slower inference, currently lag behind specialized, smaller models in privacy prediction accuracy. We also find that VLMs exhibit higher robustness to image perturbations.

Zero-shot image privacy classification with Vision-Language Models

TL;DR

The paper tackles zero-shot image privacy classification by evaluating open-source Vision-Language Models against task-specific baselines using a fair zero-shot benchmark on PrivacyAlert and IPD. It designs two prompts to probe VLMs and finds that, despite the large compute requirements, VLMs generally underperform in classification accuracy relative to specialized models, though they exhibit greater robustness to perturbations. The results underscore the current performance ceiling of VLMs for privacy tasks and provide a baseline for future work, highlighting directions such as smaller architectures and targeted prompting or fine-tuning. Overall, the work guides practitioners on the trade-offs between robustness and accuracy when choosing between generic VLMs and task-tailored privacy classifiers.

Abstract

While specialized learning-based models have historically dominated image privacy prediction, the current literature increasingly favours adopting large Vision-Language Models (VLMs) designed for generic tasks. This trend risks overlooking the performance ceiling set by purpose-built models due to a lack of systematic evaluation. To address this problem, we establish a zero-shot benchmark for image privacy classification, enabling a fair comparison. We evaluate the top-3 open-source VLMs, according to a privacy benchmark, using task-aligned prompts and we contrast their performance, efficiency, and robustness against established vision-only and multi-modal methods. Counter-intuitively, our results show that VLMs, despite their resource-intensive nature in terms of high parameter count and slower inference, currently lag behind specialized, smaller models in privacy prediction accuracy. We also find that VLMs exhibit higher robustness to image perturbations.

Paper Structure

This paper contains 4 sections, 3 figures, 3 tables.

Figures (3)

  • Figure 1: Zero-shot image privacy classification with a pre-trained VLM. Answers from selected open-source VLMs for a given image, taken from the public benchmark PrivacyAlert Zhao2022ICWSM_PrivacyAlert.
  • Figure 2: Sample answers by three open-source VMLs prompted with "Is this image likely to contain private information? Answer [Yes] or [No]." on sample images from PrivacyAlert Zhao2022ICWSM_PrivacyAlert.
  • Figure 3: Robustness of LLaVa (), Phi-3-V (), and S2P () to perturbations applied to the images of the testing set of PrivacyAlert Zhao2022ICWSM_PrivacyAlert. First column: lossy JPEG compression by varying the quality parameter when encoding the images. Second column: illumination changes by varying the brightness (gamma value) of the images. Note the logarithmic scale of the x-axis. The dashed line represents the gamma value of the original image not affected by any brightness perturbation ($\gamma=1$). Third column: salt pseudo-random noise added to the input image by preserving intensity noise values higher than a varying threshold. Fourth column: zero-mean Gaussian pseudo-random noise by varying the standard deviation of the generated noise (Gaussian std). For each generated noise, S2P is evaluated under 10 inference runs.