Table of Contents
Fetching ...

Which private attributes do VLMs agree on and predict well?

Olena Hrynenko, Darya Baranouskaya, Alina Elena Baia, Andrea Cavallaro

TL;DR

The paper addresses zero-shot privacy attribute recognition in images using open-source Visual Language Models (VLMs) and compares their outputs to human VISPR annotations. It employs three instruction-following VLMs to label 67 privacy attributes across 8,000 VISPR test images, following human annotation prompts and a parsing pipeline to convert model responses into present/absent labels. The results show strong recall and balanced accuracy above 0.75, with Qwen2.5-VL-7B-Instruct generally aligning best with human labels, while other models lag. Importantly, VLMs can complement human annotation by catching attributes humans sometimes miss, though they may also mislabel non-human content, indicating a potential for augmentation of large-scale privacy labeling with careful monitoring.

Abstract

Visual Language Models (VLMs) are often used for zero-shot detection of visual attributes in the image. We present a zero-shot evaluation of open-source VLMs for privacy-related attribute recognition. We identify the attributes for which VLMs exhibit strong inter-annotator agreement, and discuss the disagreement cases of human and VLM annotations. Our results show that when evaluated against human annotations, VLMs tend to predict the presence of privacy attributes more often than human annotators. In addition to this, we find that in cases of high inter-annotator agreement between VLMs, they can complement human annotation by identifying attributes overlooked by human annotators. This highlights the potential of VLMs to support privacy annotations in large-scale image datasets.

Which private attributes do VLMs agree on and predict well?

TL;DR

The paper addresses zero-shot privacy attribute recognition in images using open-source Visual Language Models (VLMs) and compares their outputs to human VISPR annotations. It employs three instruction-following VLMs to label 67 privacy attributes across 8,000 VISPR test images, following human annotation prompts and a parsing pipeline to convert model responses into present/absent labels. The results show strong recall and balanced accuracy above 0.75, with Qwen2.5-VL-7B-Instruct generally aligning best with human labels, while other models lag. Importantly, VLMs can complement human annotation by catching attributes humans sometimes miss, though they may also mislabel non-human content, indicating a potential for augmentation of large-scale privacy labeling with careful monitoring.

Abstract

Visual Language Models (VLMs) are often used for zero-shot detection of visual attributes in the image. We present a zero-shot evaluation of open-source VLMs for privacy-related attribute recognition. We identify the attributes for which VLMs exhibit strong inter-annotator agreement, and discuss the disagreement cases of human and VLM annotations. Our results show that when evaluated against human annotations, VLMs tend to predict the presence of privacy attributes more often than human annotators. In addition to this, we find that in cases of high inter-annotator agreement between VLMs, they can complement human annotation by identifying attributes overlooked by human annotators. This highlights the potential of VLMs to support privacy annotations in large-scale image datasets.
Paper Structure (7 sections, 5 figures, 1 table)

This paper contains 7 sections, 5 figures, 1 table.

Figures (5)

  • Figure 1: Examples of images from the VISPR dataset orekondy_towards_2017. Attributes annotated as present are shown in green, and those annotated as absent are shown in red. While both images show a group of people, the Weight Group and Height Group attributes have been omitted by a human annotator for the image in the first row.
  • Figure 2: Distribution of precision and recall for present and absent labels for zero-shot recognition of Qwen2.5-VL-7B-Instruct ( ), Gemma-3-4b-it ( ), Llama-3.2-11B-Vision-Instruct ( ) for 67 attributes of the VISPR test set orekondy_towards_2017.
  • Figure 3: Balanced accuracy of zero-shot privacy attribute recognition for Qwen2.5-VL-7B-Instruct ( ), Gemma-3-4b-it ( ), and Llama-3.2-11B-Vision-Instruct ( ) on the VISPR test set orekondy_towards_2017.
  • Figure 4: The disagreements between the human labels and VLM labels. We note that there are numerous disagreements when the VLMs predict the presence of an attribute, when, according to the human annotators, the attribute is absent. Integers denote image counts, with proportions shown in parentheses.
  • Figure 5: Left: Percentage of images with at least $N$ other human-defining attributes present according to human annotators for Age Group ( ), Gender ( ), Hair Color ( ) attributes. Right: Percentage of images with at least $N$ other relationship-defining attributes present according to human annotators for Spectators ( ) attribute. The percentage is computed out of the cases when the VLM label is present, and the human label is absent (i.e., 523 for the Age Group attribute).