Perception of Visual Content: Differences Between Humans and Foundation Models

Nardiena A. Pratama; Shaoyang Fan; Gianluca Demartini

Perception of Visual Content: Differences Between Humans and Foundation Models

Nardiena A. Pratama, Shaoyang Fan, Gianluca Demartini

TL;DR

The study addresses whether machine-generated annotations can replace human annotations for visually diverse content and how annotation choice affects bias and performance. It combines three annotation streams (ML Objects via Faster R-CNN, ML Captions via BLIP, and Human Labels from MTurk) on the Dollar Street dataset with $384$-dimensional sentence embeddings to tackle region-classification and income-regression tasks. Key findings show that ML Captions align with Human Labels at the lexical level, ML Captions yield the best region-classification performance (overall F1 ≈ 0.41), and income regression is strongest when using ML Objects+ML Captions; action-related categories favor ML-based annotations while non-action categories benefit from human input. The results suggest that human and machine annotations exhibit similar cross-regional biases, and that a diverse, hybrid annotation strategy enhances robustness and fairness, rather than allowing machine annotations to fully replace human labor. The authors provide open data and code to support reproducibility and guide annotation practices in multimodal perception research, with significance for reducing bias in deployed systems while maintaining interpretability and accountability.

Abstract

Human-annotated content is often used to train machine learning (ML) models. However, recently, language and multi-modal foundational models have been used to replace and scale-up human annotator's efforts. This study explores the similarity between human-generated and ML-generated annotations of images across diverse socio-economic contexts (RQ1) and their impact on ML model performance and bias (RQ2). We aim to understand differences in perception and identify potential biases in content interpretation. Our dataset comprises images of people from various geographical regions and income levels, covering various daily activities and home environments. ML captions and human labels show highest similarity at a low-level, i.e., types of words that appear and sentence structures, but all annotations are consistent in how they perceive images across regions. ML Captions resulted in best overall region classification performance, while ML Objects and ML Captions performed best overall for income regression. ML annotations worked best for action categories, while human input was more effective for non-action categories. These findings highlight the notion that both human and machine annotations are important, and that human-generated annotations are yet to be replaceable.

Perception of Visual Content: Differences Between Humans and Foundation Models

TL;DR

Abstract

Perception of Visual Content: Differences Between Humans and Foundation Models

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (5)