Zero-shot image privacy classification with Vision-Language Models
Alina Elena Baia, Alessio Xompero, Andrea Cavallaro
TL;DR
The paper tackles zero-shot image privacy classification by evaluating open-source Vision-Language Models against task-specific baselines using a fair zero-shot benchmark on PrivacyAlert and IPD. It designs two prompts to probe VLMs and finds that, despite the large compute requirements, VLMs generally underperform in classification accuracy relative to specialized models, though they exhibit greater robustness to perturbations. The results underscore the current performance ceiling of VLMs for privacy tasks and provide a baseline for future work, highlighting directions such as smaller architectures and targeted prompting or fine-tuning. Overall, the work guides practitioners on the trade-offs between robustness and accuracy when choosing between generic VLMs and task-tailored privacy classifiers.
Abstract
While specialized learning-based models have historically dominated image privacy prediction, the current literature increasingly favours adopting large Vision-Language Models (VLMs) designed for generic tasks. This trend risks overlooking the performance ceiling set by purpose-built models due to a lack of systematic evaluation. To address this problem, we establish a zero-shot benchmark for image privacy classification, enabling a fair comparison. We evaluate the top-3 open-source VLMs, according to a privacy benchmark, using task-aligned prompts and we contrast their performance, efficiency, and robustness against established vision-only and multi-modal methods. Counter-intuitively, our results show that VLMs, despite their resource-intensive nature in terms of high parameter count and slower inference, currently lag behind specialized, smaller models in privacy prediction accuracy. We also find that VLMs exhibit higher robustness to image perturbations.
