Table of Contents
Fetching ...

The 3D-PC: a benchmark for visual perspective taking in humans and machines

Drew Linsley, Peisen Zhou, Alekh Karkada Ashok, Akash Nagaraj, Gaurav Gaonkar, Francis E Lewis, Zygmunt Pizlo, Thomas Serre

TL;DR

The paper tackles whether emergent 3D perception in DNNs from static image training suffices for visual perspective taking (VPT). It introduces the 3D-PC benchmark, using 3D Gaussian Splatting to generate large, controlled stimuli, and evaluates depth-order, VPT-basic, and VPT-Strategy across humans and 327 DNNs with linear probing and prompting. Key findings show DNNs can match or exceed human depth-order performance but struggle with VPT-basic and fail to generalize to VPT-Strategy, though fine-tuning improves VPT-basic; ImageNet accuracy correlates with 3D capabilities, implying monocular depth cues arise alongside object recognition but are insufficient for true VPT. The work highlights a gap between machine and human 3D reasoning learned from static data and provides data, models, and tools to push toward more human-like 3D perception in AI.

Abstract

Visual perspective taking (VPT) is the ability to perceive and reason about the perspectives of others. It is an essential feature of human intelligence, which develops over the first decade of life and requires an ability to process the 3D structure of visual scenes. A growing number of reports have indicated that deep neural networks (DNNs) become capable of analyzing 3D scenes after training on large image datasets. We investigated if this emergent ability for 3D analysis in DNNs is sufficient for VPT with the 3D perception challenge (3D-PC): a novel benchmark for 3D perception in humans and DNNs. The 3D-PC is comprised of three 3D-analysis tasks posed within natural scene images: 1. a simple test of object depth order, 2. a basic VPT task (VPT-basic), and 3. another version of VPT (VPT-Strategy) designed to limit the effectiveness of "shortcut" visual strategies. We tested human participants (N=33) and linearly probed or text-prompted over 300 DNNs on the challenge and found that nearly all of the DNNs approached or exceeded human accuracy in analyzing object depth order. Surprisingly, DNN accuracy on this task correlated with their object recognition performance. In contrast, there was an extraordinary gap between DNNs and humans on VPT-basic. Humans were nearly perfect, whereas most DNNs were near chance. Fine-tuning DNNs on VPT-basic brought them close to human performance, but they, unlike humans, dropped back to chance when tested on VPT-Strategy. Our challenge demonstrates that the training routines and architectures of today's DNNs are well-suited for learning basic 3D properties of scenes and objects but are ill-suited for reasoning about these properties as humans do. We release our 3D-PC datasets and code to help bridge this gap in 3D perception between humans and machines.

The 3D-PC: a benchmark for visual perspective taking in humans and machines

TL;DR

The paper tackles whether emergent 3D perception in DNNs from static image training suffices for visual perspective taking (VPT). It introduces the 3D-PC benchmark, using 3D Gaussian Splatting to generate large, controlled stimuli, and evaluates depth-order, VPT-basic, and VPT-Strategy across humans and 327 DNNs with linear probing and prompting. Key findings show DNNs can match or exceed human depth-order performance but struggle with VPT-basic and fail to generalize to VPT-Strategy, though fine-tuning improves VPT-basic; ImageNet accuracy correlates with 3D capabilities, implying monocular depth cues arise alongside object recognition but are insufficient for true VPT. The work highlights a gap between machine and human 3D reasoning learned from static data and provides data, models, and tools to push toward more human-like 3D perception in AI.

Abstract

Visual perspective taking (VPT) is the ability to perceive and reason about the perspectives of others. It is an essential feature of human intelligence, which develops over the first decade of life and requires an ability to process the 3D structure of visual scenes. A growing number of reports have indicated that deep neural networks (DNNs) become capable of analyzing 3D scenes after training on large image datasets. We investigated if this emergent ability for 3D analysis in DNNs is sufficient for VPT with the 3D perception challenge (3D-PC): a novel benchmark for 3D perception in humans and DNNs. The 3D-PC is comprised of three 3D-analysis tasks posed within natural scene images: 1. a simple test of object depth order, 2. a basic VPT task (VPT-basic), and 3. another version of VPT (VPT-Strategy) designed to limit the effectiveness of "shortcut" visual strategies. We tested human participants (N=33) and linearly probed or text-prompted over 300 DNNs on the challenge and found that nearly all of the DNNs approached or exceeded human accuracy in analyzing object depth order. Surprisingly, DNN accuracy on this task correlated with their object recognition performance. In contrast, there was an extraordinary gap between DNNs and humans on VPT-basic. Humans were nearly perfect, whereas most DNNs were near chance. Fine-tuning DNNs on VPT-basic brought them close to human performance, but they, unlike humans, dropped back to chance when tested on VPT-Strategy. Our challenge demonstrates that the training routines and architectures of today's DNNs are well-suited for learning basic 3D properties of scenes and objects but are ill-suited for reasoning about these properties as humans do. We release our 3D-PC datasets and code to help bridge this gap in 3D perception between humans and machines.
Paper Structure (32 sections, 11 figures, 1 table)

This paper contains 32 sections, 11 figures, 1 table.

Figures (11)

  • Figure 1: Visual Perspective Taking (VPT) is the ability to analyze scenes from different viewpoints.(A) Humans rely on VPT to anticipate the behavior of others. We expect that this ability will be essential for creating the next generation of AI assistants that can accurately anticipate human behavior (images are CC BY-NC). (B) VPT has been studied in developmental psychology since the mid-20$^{\mathrm{th}}$ century using cartoon or highly synthetic stimuli. For example, Piaget's "Three Mountains Task" asks observers to describe the scene from the perspective of a bear (image from Bruce2017-km). (C) Here, we use Gaussian Splatting Kerbl2023-ll to develop a 3D scene generation pipeline for the 3D perception challenge (3D-PC), to systematically compare 3D perception capabilities of human and machine vision systems. (D) The 3D-PC tests 1. Object depth perception, and 2. VPT.
  • Figure 2: 3D-PC examples. We tested 3D perception in images generated by Gaussian Splatting. Each image depicts a green camera and a red ball. These objects are placed in the scene in a way that counterbalances labels for depth order task and VPT-basic tasks.
  • Figure 3: Human accuracy for object depth order and VPT-basic tasks. Bars near 50% are label-permuted noise floors; lines are group means. The difference is significant, *** $=p < 0.001.$
  • Figure 4: DNN performance on the depth order and VPT-basic tasks in the 3D-PC after linear probing or prompting.(A, B) DNNs are significantly more accurate at depth order than VPT-basic. Human confidence intervals are S.E.M. and ***: $p < 0.001$. (C, D) DNN accuracy for depth order and VPT-basic strongly correlates with object classification accuracy on ImageNet. Dashed lines are the mean of label-permuted human noise floors.
  • Figure 5: DNN performance on the depth order and VPT-basic tasks in the 3D-PC after fine-tuning.(A) Fine-tuning makes DNNs far better than humans at the depth order task and improves the performance of several DNNs to be at or beyond human accuracy on VPT-basic. (B) Even after fine-tuning, there is still a significant difference in model performance on depth order and VPT-basic tasks, $p < 0.001$. (C, D) DNN accuracy on both tasks after fine-tuning correlates with ImageNet object classification accuracy. Human confidence intervals are S.E.M. and ***: p < 0.001. Dashed lines are the mean of label-permuted human noise floors.
  • ...and 6 more figures