Table of Contents
Fetching ...

Evaluating Multiview Object Consistency in Humans and Image Models

Tyler Bonnen, Stephanie Fu, Yutong Bai, Thomas O'Connell, Yoni Friedman, Nancy Kanwisher, Joshua B. Tenenbaum, Alexei A. Efros

TL;DR

A benchmark to directly evaluate the alignment between human observers and vision models on a 3D shape inference task is introduced and it is found that humans outperform all models by a wide margin.

Abstract

We introduce a benchmark to directly evaluate the alignment between human observers and vision models on a 3D shape inference task. We leverage an experimental design from the cognitive sciences which requires zero-shot visual inferences about object shape: given a set of images, participants identify which contain the same/different objects, despite considerable viewpoint variation. We draw from a diverse range of images that include common objects (e.g., chairs) as well as abstract shapes (i.e., procedurally generated `nonsense' objects). After constructing over 2000 unique image sets, we administer these tasks to human participants, collecting 35K trials of behavioral data from over 500 participants. This includes explicit choice behaviors as well as intermediate measures, such as reaction time and gaze data. We then evaluate the performance of common vision models (e.g., DINOv2, MAE, CLIP). We find that humans outperform all models by a wide margin. Using a multi-scale evaluation approach, we identify underlying similarities and differences between models and humans: while human-model performance is correlated, humans allocate more time/processing on challenging trials. All images, data, and code can be accessed via our project page.

Evaluating Multiview Object Consistency in Humans and Image Models

TL;DR

A benchmark to directly evaluate the alignment between human observers and vision models on a 3D shape inference task is introduced and it is found that humans outperform all models by a wide margin.

Abstract

We introduce a benchmark to directly evaluate the alignment between human observers and vision models on a 3D shape inference task. We leverage an experimental design from the cognitive sciences which requires zero-shot visual inferences about object shape: given a set of images, participants identify which contain the same/different objects, despite considerable viewpoint variation. We draw from a diverse range of images that include common objects (e.g., chairs) as well as abstract shapes (i.e., procedurally generated `nonsense' objects). After constructing over 2000 unique image sets, we administer these tasks to human participants, collecting 35K trials of behavioral data from over 500 participants. This includes explicit choice behaviors as well as intermediate measures, such as reaction time and gaze data. We then evaluate the performance of common vision models (e.g., DINOv2, MAE, CLIP). We find that humans outperform all models by a wide margin. Using a multi-scale evaluation approach, we identify underlying similarities and differences between models and humans: while human-model performance is correlated, humans allocate more time/processing on challenging trials. All images, data, and code can be accessed via our project page.
Paper Structure (44 sections, 26 figures, 2 tables)

This paper contains 44 sections, 26 figures, 2 tables.

Figures (26)

  • Figure 1: How well do computer vision models represent the 3D structure of objects? We develop a benchmark using a shape inference task from the cognitive sciences: Multiview Object Consistency in Humans and in Image models (MOCHI). Given three images of objects from random viewpoints, observers must identify which image depicts the object that is different. We compare human performance (35K trials from over 500 subjects, including accuracy, reaction time, and gaze data) against a number of standard computer vision models.
  • Figure 2: Example stimuli from the four datasets in MOCHI. Each trial is composed of a triplet of images containing two objects: one from two different viewpoints (A and A$'$), and another object (B). Depending on the experiment, participants either infer the matching/non-matching object (pairing A-A$'$, or identifying B). B in each trial is marked with a * for illustrative purposes. Descriptions and examples of all categories in this benchmark can be found in the appendix (\ref{['descriptions_of_categories']}).
  • Figure 3: Distribution of human accuracy across trials in each dataset in this benchmark. While humans are reliably accurate across trials, there is a long-tailed distribution of performance. This is by design, as it provides more challenging behavioral targets to model.
  • Figure 4: Accuracy and reaction time distributions for human participants across datasets. Across datasets we observe a clear relationship between accuracy and processing time; as trials become more difficult, participants allocate more attention/time. Critically, the distribution of human behavior ranges from chance to ceiling, indicating that we have a suitable estimate of the full range of human visual abilities. This and all subsequent error bars are SEM computed over trials.
  • Figure 5: Examples of human saliency maps collected on a subset of images in MOCHI. Given the foveal nature of primate vision, humans must move their gaze in order to collect high-acuity visual information. As such, measuring human gaze patterns reveals human attention patterns. We collect gaze behavior from human participants on all trials in one stimulus set (barense).
  • ...and 21 more figures