Table of Contents
Fetching ...

Egocentric Bias in Vision-Language Models

Maijunxian Wang, Yijiang Li, Bingyang Wang, Tianwei Zhao, Ran Ji, Qingying Gao, Emmy Liu, Hokin Deng, Dezhi Luo

TL;DR

The paper tackles Level-2 Visual Perspective Taking ($L2$ VPT) in Vision-Language Models by introducing FlipSet, a benchmark that isolates the spatial transformation component using 2D strings rotated by $180^\circ$ and a controlled four-way response scheme to separate correct, egocentric, confusable, and random answers. It evaluates 103 publicly available VLMs under zero-shot conditions and conducts control tests to disentangle Theory of Mind (ToM) from Mental Rotation (MR) and their integration in $L2$ VPT. The results reveal a robust egocentric bias: most models rely on the camera viewpoint rather than simulating the monkey's perspective, with average performance far below chance; chain-of-thought reasoning does not alleviate this. Control experiments show ToM is strong, MR is modest, and $L2$ VPT is markedly deficient, indicating a compositional deficit where models fail to integrate perspective awareness with spatial transformations in situated reasoning. FlipSet thus provides a cognitively grounded diagnostic for diagnosing perspective-taking capabilities in multimodal systems and highlights the need for architectural innovations that enable model-based spatial reasoning and binding of social awareness to spatial operations.

Abstract

Visual perspective taking--inferring how the world appears from another's viewpoint--is foundational to social cognition. We introduce FlipSet, a diagnostic benchmark for Level-2 visual perspective taking (L2 VPT) in vision-language models. The task requires simulating 180-degree rotations of 2D character strings from another agent's perspective, isolating spatial transformation from 3D scene complexity. Evaluating 103 VLMs reveals systematic egocentric bias: the vast majority perform below chance, with roughly three-quarters of errors reproducing the camera viewpoint. Control experiments expose a compositional deficit--models achieve high theory-of-mind accuracy and above-chance mental rotation in isolation, yet fail catastrophically when integration is required. This dissociation indicates that current VLMs lack the mechanisms needed to bind social awareness to spatial operations, suggesting fundamental limitations in model-based spatial reasoning. FlipSet provides a cognitively grounded testbed for diagnosing perspective-taking capabilities in multimodal systems.

Egocentric Bias in Vision-Language Models

TL;DR

The paper tackles Level-2 Visual Perspective Taking ( VPT) in Vision-Language Models by introducing FlipSet, a benchmark that isolates the spatial transformation component using 2D strings rotated by and a controlled four-way response scheme to separate correct, egocentric, confusable, and random answers. It evaluates 103 publicly available VLMs under zero-shot conditions and conducts control tests to disentangle Theory of Mind (ToM) from Mental Rotation (MR) and their integration in VPT. The results reveal a robust egocentric bias: most models rely on the camera viewpoint rather than simulating the monkey's perspective, with average performance far below chance; chain-of-thought reasoning does not alleviate this. Control experiments show ToM is strong, MR is modest, and VPT is markedly deficient, indicating a compositional deficit where models fail to integrate perspective awareness with spatial transformations in situated reasoning. FlipSet thus provides a cognitively grounded diagnostic for diagnosing perspective-taking capabilities in multimodal systems and highlights the need for architectural innovations that enable model-based spatial reasoning and binding of social awareness to spatial operations.

Abstract

Visual perspective taking--inferring how the world appears from another's viewpoint--is foundational to social cognition. We introduce FlipSet, a diagnostic benchmark for Level-2 visual perspective taking (L2 VPT) in vision-language models. The task requires simulating 180-degree rotations of 2D character strings from another agent's perspective, isolating spatial transformation from 3D scene complexity. Evaluating 103 VLMs reveals systematic egocentric bias: the vast majority perform below chance, with roughly three-quarters of errors reproducing the camera viewpoint. Control experiments expose a compositional deficit--models achieve high theory-of-mind accuracy and above-chance mental rotation in isolation, yet fail catastrophically when integration is required. This dissociation indicates that current VLMs lack the mechanisms needed to bind social awareness to spatial operations, suggesting fundamental limitations in model-based spatial reasoning. FlipSet provides a cognitively grounded testbed for diagnosing perspective-taking capabilities in multimodal systems.
Paper Structure (19 sections, 4 figures, 3 tables)

This paper contains 19 sections, 4 figures, 3 tables.

Figures (4)

  • Figure 1: FlipSet benchmark design and evaluation approach.(a)Prompt type and Error types in model responses across cognitive tasks. Each FlipSet item asks the model what a monkey sees on the back of a card—requiring a 180° mental rotation from the monkey's viewpoint. Answer options correspond to distinct reasoning outcomes: Correct (successful perspective taking), Egocentric (front-view repetition), Confusable (visually similar distractor), Random (unrelated guess), and Fail (invalid or empty output). (b)Three cognitive tasks comparison. L1 VPT requires simple visibility judgment, MR involves pure geometric transformation, and L2 VPT demands complex perspective simulation. All tasks use identical visual stimuli.
  • Figure 2: Error type distribution across 103 models on L2 visual perspective taking tasks. Categories include Correct, Egocentric, Confusable, Random, and Fail responses. Complete results by model families are presented in Table \ref{['tab:model_families']}.
  • Figure 3: Control Experiment Results: Model Performance and Task Correlations.(a) Performance comparison of 24 models on three cognitive tasks: ToM (theory of mind), L2 (Level 2 perspective taking), and MR (mental rotation). Models are grouped by family and sorted by L2 accuracy within each family. (b) Correlation analysis between the three cognitive tasks, showing the relationships between ToM, L2 VPT, and MR performance across the 24 models.
  • Figure A1: Error curves across 12 answer-position permutations. Egocentric error rates fluctuate by $\approx$9pp, while confusable error rates fluctuate by $\approx$12pp, with particularly high confusable rates in layouts 7–9; overall accuracy shifts remain within $\approx$11pp. This suggests that answer position has limited influence on model behavior—errors persist due to underlying biases rather than positional effects.