Table of Contents
Fetching ...

MMPerspective: Do MLLMs Understand Perspective? A Comprehensive Benchmark for Perspective Perception, Reasoning, and Robustness

Yolo Y. Tang, Pinxin Liu, Zhangyun Tan, Mingqian Feng, Rui Mao, Chao Huang, Jing Bi, Yunzhong Xiao, Susan Liang, Hang Hua, Ali Vosoughi, Luchuan Song, Zeliang Zhang, Chenliang Xu

TL;DR

MMPerspective introduces the first dedicated benchmark to assess perspective understanding in multimodal LLMs, spanning 10 tasks across Perspective Perception, Reasoning, and Robustness. The dataset comprises 2,711 images and 5,083 QA pairs, evaluated across 43 MLLMs to reveal strong size-related gains in basic perception but weak robustness and compositional spatial reasoning. Key findings show that larger models scale perspective capability more than vision encoders, chain-of-thought prompting provides reliable gains, and architectural biases shape failure modes. The benchmark serves as a diagnostic tool and a roadmap for building geometry-aware vision-language systems with improved spatial priors and reasoning. This work has practical implications for advancing robust, geometry-grounded multimodal understanding in real-world visual cognition tasks.

Abstract

Understanding perspective is fundamental to human visual perception, yet the extent to which multimodal large language models (MLLMs) internalize perspective geometry remains unclear. We introduce MMPerspective, the first benchmark specifically designed to systematically evaluate MLLMs' understanding of perspective through 10 carefully crafted tasks across three complementary dimensions: Perspective Perception, Reasoning, and Robustness. Our benchmark comprises 2,711 real-world and synthetic image instances with 5,083 question-answer pairs that probe key capabilities, such as vanishing point perception and counting, perspective type reasoning, line relationship understanding in 3D space, invariance to perspective-preserving transformations, etc. Through a comprehensive evaluation of 43 state-of-the-art MLLMs, we uncover significant limitations: while models demonstrate competence on surface-level perceptual tasks, they struggle with compositional reasoning and maintaining spatial consistency under perturbations. Our analysis further reveals intriguing patterns between model architecture, scale, and perspective capabilities, highlighting both robustness bottlenecks and the benefits of chain-of-thought prompting. MMPerspective establishes a valuable testbed for diagnosing and advancing spatial understanding in vision-language systems. Resources available at: https://yunlong10.github.io/MMPerspective/

MMPerspective: Do MLLMs Understand Perspective? A Comprehensive Benchmark for Perspective Perception, Reasoning, and Robustness

TL;DR

MMPerspective introduces the first dedicated benchmark to assess perspective understanding in multimodal LLMs, spanning 10 tasks across Perspective Perception, Reasoning, and Robustness. The dataset comprises 2,711 images and 5,083 QA pairs, evaluated across 43 MLLMs to reveal strong size-related gains in basic perception but weak robustness and compositional spatial reasoning. Key findings show that larger models scale perspective capability more than vision encoders, chain-of-thought prompting provides reliable gains, and architectural biases shape failure modes. The benchmark serves as a diagnostic tool and a roadmap for building geometry-aware vision-language systems with improved spatial priors and reasoning. This work has practical implications for advancing robust, geometry-grounded multimodal understanding in real-world visual cognition tasks.

Abstract

Understanding perspective is fundamental to human visual perception, yet the extent to which multimodal large language models (MLLMs) internalize perspective geometry remains unclear. We introduce MMPerspective, the first benchmark specifically designed to systematically evaluate MLLMs' understanding of perspective through 10 carefully crafted tasks across three complementary dimensions: Perspective Perception, Reasoning, and Robustness. Our benchmark comprises 2,711 real-world and synthetic image instances with 5,083 question-answer pairs that probe key capabilities, such as vanishing point perception and counting, perspective type reasoning, line relationship understanding in 3D space, invariance to perspective-preserving transformations, etc. Through a comprehensive evaluation of 43 state-of-the-art MLLMs, we uncover significant limitations: while models demonstrate competence on surface-level perceptual tasks, they struggle with compositional reasoning and maintaining spatial consistency under perturbations. Our analysis further reveals intriguing patterns between model architecture, scale, and perspective capabilities, highlighting both robustness bottlenecks and the benefits of chain-of-thought prompting. MMPerspective establishes a valuable testbed for diagnosing and advancing spatial understanding in vision-language systems. Resources available at: https://yunlong10.github.io/MMPerspective/

Paper Structure

This paper contains 31 sections, 2 equations, 39 figures, 4 tables.

Figures (39)

  • Figure 1: MMPerspective benchmark overview. We introduce 10 tasks spanning 3 complementary dimensions of perspective understanding: Perspective Perception, Reasoning, and Robustness.
  • Figure 2: Left: MMPerspective benchmark consists of 2,711 instances and 5,083 QA pairs, hierarchically organized into 3 core categories (Perspective Perception, Reasoning, and Robustness). Right: The accuracy of 8 representative MLLMs on 10 tasks of MMPerspective across the 3 categories.
  • Figure 3: Perspective illustration with terminology. The figure is adapted from robertson2013draw.
  • Figure 4: Data Curation Pipeline for MMPerspective.
  • Figure 5: Heatmaps illustrating the relationship between model size and performance, measured by P&R Overall Accuracy and Robustness. Darker colors indicate higher performance. Each line represents a model family, with sizes increasing from left to right.
  • ...and 34 more figures