Table of Contents
Fetching ...

Do You See Me : A Multidimensional Benchmark for Evaluating Visual Perception in Multimodal LLMs

Aditya Kanade, Tanuja Ganu

TL;DR

The paper introduces Do You See Me, a scalable, programmatically generated benchmark to rigorously evaluate visual perception in multimodal LLMs across 2D and 3D tasks inspired by human psychology. It pairs a joint perception-reasoning dataset with a broad evaluation of eleven leading MLLMs, showing humans vastly outperforming models (≈95% vs below 50%), and revealing that correct reasoning can mask underlying perceptual failures. The study also analyzes failure modes, finds limited gains from supervised finetuning, and shows that task formats (MCQ) and Chain-of-Thought prompting can both help and hinder performance depending on task verbalizability. The work highlights fundamental perceptual bottlenecks in current MLLMs and provides open-source data and code to drive future improvements in robust visual grounding and perception-aware reasoning.

Abstract

Multimodal Large Language Models (MLLMs) show reasoning promise, yet their visual perception is a critical bottleneck. Strikingly, MLLMs can produce correct answers even while misinterpreting crucial visual elements, masking these underlying failures. Our preliminary study on a joint perception-reasoning dataset revealed that for one leading MLLM, 29% of its correct answers to reasoning questions still exhibited visual perception errors. To systematically address this, we introduce "Do You See Me", a scalable benchmark with 1,758 images and 2,612 questions. It spans seven human-psychology inspired subtasks in 2D and 3D, featuring controllable complexity to rigorously evaluate MLLM visual skills. Our findings on 3 leading closed-source and 5 major open-source models reveal a stark deficit: humans achieve 96.49% accuracy, while top MLLMs average below 50%. This performance gap widens rapidly with increased task complexity (e.g., from 12% to 45% in the visual form constancy subtask). Further analysis into the root causes suggests that failures stem from challenges like misallocated visual attention and the instability of internal representations for fine-grained details, especially at or below encoder patch resolution. This underscores an urgent need for MLLMs with truly robust visual perception. The benchmark dataset, source code and evaluation scripts are available at https://github.com/microsoft/Do-You-See-Me.

Do You See Me : A Multidimensional Benchmark for Evaluating Visual Perception in Multimodal LLMs

TL;DR

The paper introduces Do You See Me, a scalable, programmatically generated benchmark to rigorously evaluate visual perception in multimodal LLMs across 2D and 3D tasks inspired by human psychology. It pairs a joint perception-reasoning dataset with a broad evaluation of eleven leading MLLMs, showing humans vastly outperforming models (≈95% vs below 50%), and revealing that correct reasoning can mask underlying perceptual failures. The study also analyzes failure modes, finds limited gains from supervised finetuning, and shows that task formats (MCQ) and Chain-of-Thought prompting can both help and hinder performance depending on task verbalizability. The work highlights fundamental perceptual bottlenecks in current MLLMs and provides open-source data and code to drive future improvements in robust visual grounding and perception-aware reasoning.

Abstract

Multimodal Large Language Models (MLLMs) show reasoning promise, yet their visual perception is a critical bottleneck. Strikingly, MLLMs can produce correct answers even while misinterpreting crucial visual elements, masking these underlying failures. Our preliminary study on a joint perception-reasoning dataset revealed that for one leading MLLM, 29% of its correct answers to reasoning questions still exhibited visual perception errors. To systematically address this, we introduce "Do You See Me", a scalable benchmark with 1,758 images and 2,612 questions. It spans seven human-psychology inspired subtasks in 2D and 3D, featuring controllable complexity to rigorously evaluate MLLM visual skills. Our findings on 3 leading closed-source and 5 major open-source models reveal a stark deficit: humans achieve 96.49% accuracy, while top MLLMs average below 50%. This performance gap widens rapidly with increased task complexity (e.g., from 12% to 45% in the visual form constancy subtask). Further analysis into the root causes suggests that failures stem from challenges like misallocated visual attention and the instability of internal representations for fine-grained details, especially at or below encoder patch resolution. This underscores an urgent need for MLLMs with truly robust visual perception. The benchmark dataset, source code and evaluation scripts are available at https://github.com/microsoft/Do-You-See-Me.

Paper Structure

This paper contains 34 sections, 25 figures, 11 tables.

Figures (25)

  • Figure 1: Visual Misinterpretations in Popular Multimodal LLMs
  • Figure 2: Do You See Me benchmark visual perception dimensions
  • Figure 3: Comparison of MLLM visual reasoning performance (a) and error breakdowns (b, c) for correct and incorrect final answers respectively (Claude Sonnet-3.5).
  • Figure 4: Comparison of MLLM and human performance across controlled difficulty levels: (a) on Letter Disambiguation task, and (b) on Visual Form Constancy.
  • Figure 5: Human Performance Benchmarking
  • ...and 20 more figures