Table of Contents
Fetching ...

Perception of Visual Content: Differences Between Humans and Foundation Models

Nardiena A. Pratama, Shaoyang Fan, Gianluca Demartini

TL;DR

The study addresses whether machine-generated annotations can replace human annotations for visually diverse content and how annotation choice affects bias and performance. It combines three annotation streams (ML Objects via Faster R-CNN, ML Captions via BLIP, and Human Labels from MTurk) on the Dollar Street dataset with $384$-dimensional sentence embeddings to tackle region-classification and income-regression tasks. Key findings show that ML Captions align with Human Labels at the lexical level, ML Captions yield the best region-classification performance (overall F1 ≈ 0.41), and income regression is strongest when using ML Objects+ML Captions; action-related categories favor ML-based annotations while non-action categories benefit from human input. The results suggest that human and machine annotations exhibit similar cross-regional biases, and that a diverse, hybrid annotation strategy enhances robustness and fairness, rather than allowing machine annotations to fully replace human labor. The authors provide open data and code to support reproducibility and guide annotation practices in multimodal perception research, with significance for reducing bias in deployed systems while maintaining interpretability and accountability.

Abstract

Human-annotated content is often used to train machine learning (ML) models. However, recently, language and multi-modal foundational models have been used to replace and scale-up human annotator's efforts. This study explores the similarity between human-generated and ML-generated annotations of images across diverse socio-economic contexts (RQ1) and their impact on ML model performance and bias (RQ2). We aim to understand differences in perception and identify potential biases in content interpretation. Our dataset comprises images of people from various geographical regions and income levels, covering various daily activities and home environments. ML captions and human labels show highest similarity at a low-level, i.e., types of words that appear and sentence structures, but all annotations are consistent in how they perceive images across regions. ML Captions resulted in best overall region classification performance, while ML Objects and ML Captions performed best overall for income regression. ML annotations worked best for action categories, while human input was more effective for non-action categories. These findings highlight the notion that both human and machine annotations are important, and that human-generated annotations are yet to be replaceable.

Perception of Visual Content: Differences Between Humans and Foundation Models

TL;DR

The study addresses whether machine-generated annotations can replace human annotations for visually diverse content and how annotation choice affects bias and performance. It combines three annotation streams (ML Objects via Faster R-CNN, ML Captions via BLIP, and Human Labels from MTurk) on the Dollar Street dataset with -dimensional sentence embeddings to tackle region-classification and income-regression tasks. Key findings show that ML Captions align with Human Labels at the lexical level, ML Captions yield the best region-classification performance (overall F1 ≈ 0.41), and income regression is strongest when using ML Objects+ML Captions; action-related categories favor ML-based annotations while non-action categories benefit from human input. The results suggest that human and machine annotations exhibit similar cross-regional biases, and that a diverse, hybrid annotation strategy enhances robustness and fairness, rather than allowing machine annotations to fully replace human labor. The authors provide open data and code to support reproducibility and guide annotation practices in multimodal perception research, with significance for reducing bias in deployed systems while maintaining interpretability and accountability.

Abstract

Human-annotated content is often used to train machine learning (ML) models. However, recently, language and multi-modal foundational models have been used to replace and scale-up human annotator's efforts. This study explores the similarity between human-generated and ML-generated annotations of images across diverse socio-economic contexts (RQ1) and their impact on ML model performance and bias (RQ2). We aim to understand differences in perception and identify potential biases in content interpretation. Our dataset comprises images of people from various geographical regions and income levels, covering various daily activities and home environments. ML captions and human labels show highest similarity at a low-level, i.e., types of words that appear and sentence structures, but all annotations are consistent in how they perceive images across regions. ML Captions resulted in best overall region classification performance, while ML Objects and ML Captions performed best overall for income regression. ML annotations worked best for action categories, while human input was more effective for non-action categories. These findings highlight the notion that both human and machine annotations are important, and that human-generated annotations are yet to be replaceable.

Paper Structure

This paper contains 15 sections, 5 figures, 5 tables.

Figures (5)

  • Figure 1: Example images annotated with the pre-trained Faster R-CNN model. Left: Hand Washing. Right: Living Rooms.
  • Figure 2: Screenshot of the MTurk annotation interface showing the task instructions and image assessment layout.
  • Figure 3: t-SNE visualisation of sentence embeddings of each label
  • Figure 4: Pairwise relationships between similarity scores.
  • Figure 5: Income Regression Model using ML Objects and Captions as Annotations Grouped by Continent. The x-axis represents the ground truth values, plotted in log scale, while the y-axis represents the predicted values, plotted in linear scale.