Table of Contents
Fetching ...

Visuospatial Perspective Taking in Multimodal Language Models

Jonathan Prunty, Seraphina Zhang, Patrick Quinn, Jianxun Lian, Xing Xie, Lucy Cheke

Abstract

As multimodal language models (MLMs) are increasingly used in social and collaborative settings, it is crucial to evaluate their perspective-taking abilities. Existing benchmarks largely rely on text-based vignettes or static scene understanding, leaving visuospatial perspective-taking (VPT) underexplored. We adapt two evaluation tasks from human studies: the Director Task, assessing VPT in a referential communication paradigm, and the Rotating Figure Task, probing perspective-taking across angular disparities. Across tasks, MLMs show pronounced deficits in Level 2 VPT, which requires inhibiting one's own perspective to adopt another's. These results expose critical limitations in current MLMs' ability to represent and reason about alternative perspectives, with implications for their use in collaborative contexts.

Visuospatial Perspective Taking in Multimodal Language Models

Abstract

As multimodal language models (MLMs) are increasingly used in social and collaborative settings, it is crucial to evaluate their perspective-taking abilities. Existing benchmarks largely rely on text-based vignettes or static scene understanding, leaving visuospatial perspective-taking (VPT) underexplored. We adapt two evaluation tasks from human studies: the Director Task, assessing VPT in a referential communication paradigm, and the Rotating Figure Task, probing perspective-taking across angular disparities. Across tasks, MLMs show pronounced deficits in Level 2 VPT, which requires inhibiting one's own perspective to adopt another's. These results expose critical limitations in current MLMs' ability to represent and reason about alternative perspectives, with implications for their use in collaborative contexts.
Paper Structure (22 sections, 11 figures, 21 tables)

This paper contains 22 sections, 11 figures, 21 tables.

Figures (11)

  • Figure 1: Example stimuli from the Rotating Figure Task (visual field cone shown in yellow for illustration). Left: the figure can see the symbol in front of them (Test 1). Centre: the symbol p appears as a d from the figure’s perspective, and is located on their right (Test 2). Right: If asked about the symbol on the figure's right appears to them, the correct response would be 6, requiring both visual and spatial perspective taking (Test 3). A line-of-sight arrow was added to scaffold model's ability to perceive the viewing direction of figures from a top-down perspective.
  • Figure 2: Mean MLM accuracy in Test conditions of the Figure Rotation Task, binned by figure rotation angle (12 bins) and aggregated across visual and spatial questions in Tests 1–2. 0 indicates a shared perspective and 180 an opposite perspective; 90 and -90 correspond to right- and left-facing figures. The dashed line indicates chance.
  • Figure 3: Example stimulus from the Director Task. If the director says "Please select the rightmost blue, non-striped item of clothing from my point of view", he would be referring to the item in C3. The answer is not D1 or D4, as the director specified the rightmost item from his perspective. The answer is also not A1 as the item is occluded from the director's view. All other items do not fit the object specification.
  • Figure 4: MLM mean accuracy in the Director Task on visual (occluded vs. no occluded alternatives) and spatial (horizontal vs. vertical adjectives) VPT trials. The data are plotted separately for Image and ASCII tasks, and include only spatial relative adjective trials from the director's perspective. Error bars are 95 percent confidence intervals.
  • Figure 5: Example grid from the ASCII version of the Director Task, shown as rendered from the underlying markdown representation. Item descriptors and attributes (e.g., size) are displayed explicitly, and occluded items are marked with [BLOCKED].
  • ...and 6 more figures