Table of Contents
Fetching ...

Do MLLMs Exhibit Human-like Perceptual Behaviors? HVSBench: A Benchmark for MLLM Alignment with Human Perceptual Behavior

Jiaying Lin, Shuquan Ye, Dan Xu, Wanli Ouyang, Rynson W. H. Lau

TL;DR

HVSBench introduces a large-scale, human-centric benchmark to evaluate how well Multimodal LLMs align with human perceptual behavior across five key visual fields. The authors establish a robust automatic standardization and evaluation protocol, and comprehensively compare 26 MLLMs and humans, revealing a sizable gap between models and human perceptual alignment. Their results show that larger and newer models improve but still struggle with scanpath prediction and field-specific tasks, highlighting the need for human-aligned design. The work also discusses practical benefits, including improved QA, captioning, and content generation, and provides concrete datasets and methodologies to advance human-aligned AI systems.

Abstract

While Multimodal Large Language Models (MLLMs) excel at many vision tasks, it is unknown if they exhibit human-like perceptual behaviors. To evaluate this, we introduce HVSBench, the first large-scale benchmark with over 85,000 samples designed to test MLLM alignment with the human visual system (HVS). The benchmark covers 13 categories across 5 key fields: Prominence, Subitizing, Prioritizing, Free-Viewing, and Searching. Our comprehensive evaluation reveals a significant perceptual gap: even state-of-the-art MLLMs achieve only moderate results. In contrast, human participants demonstrate strong performance, significantly outperforming all models. This underscores the high quality of HVSBench and the need for more human-aligned AI. We believe our benchmark will be a critical tool for developing the next generation of explainable MLLMs.

Do MLLMs Exhibit Human-like Perceptual Behaviors? HVSBench: A Benchmark for MLLM Alignment with Human Perceptual Behavior

TL;DR

HVSBench introduces a large-scale, human-centric benchmark to evaluate how well Multimodal LLMs align with human perceptual behavior across five key visual fields. The authors establish a robust automatic standardization and evaluation protocol, and comprehensively compare 26 MLLMs and humans, revealing a sizable gap between models and human perceptual alignment. Their results show that larger and newer models improve but still struggle with scanpath prediction and field-specific tasks, highlighting the need for human-aligned design. The work also discusses practical benefits, including improved QA, captioning, and content generation, and provides concrete datasets and methodologies to advance human-aligned AI systems.

Abstract

While Multimodal Large Language Models (MLLMs) excel at many vision tasks, it is unknown if they exhibit human-like perceptual behaviors. To evaluate this, we introduce HVSBench, the first large-scale benchmark with over 85,000 samples designed to test MLLM alignment with the human visual system (HVS). The benchmark covers 13 categories across 5 key fields: Prominence, Subitizing, Prioritizing, Free-Viewing, and Searching. Our comprehensive evaluation reveals a significant perceptual gap: even state-of-the-art MLLMs achieve only moderate results. In contrast, human participants demonstrate strong performance, significantly outperforming all models. This underscores the high quality of HVSBench and the need for more human-aligned AI. We believe our benchmark will be a critical tool for developing the next generation of explainable MLLMs.

Paper Structure

This paper contains 29 sections, 14 figures, 11 tables.

Figures (14)

  • Figure 1: We are the first to systematically study and assess MLLMs-HVS alignment. (a) We propose large-scale and comprehensive HVSBench, with a robust evaluation protocol. (b) Our comparisons among humans and the state-of-the-art model Qwen3-VL on HVSBench across 5 fields reveal room for improvement and insights for developing HVS-aligned MLLMs.
  • Figure 2: Simplified Samples of 13 question types in HVSBench. GT ranks and scanpath plots are for better visuals.
  • Figure 3: Illustration of our automatic standardization, which robustly formats predictions without introducing errors. In contrast, LLM-based matching (e.g., GPT-4) is both costly and prone to errors, such as failing to extract the correct format or predicting unrelated outputs.
  • Figure 4: Qualitative results. The bounding boxes, the scanpaths and the GT masks from the source datasets (e.g., rank & instances) are for visual clarity and not used in the input images for evaluation. Text is partially omitted due to limited space.
  • Figure 5: MLLMs' predictions differ from GT reasonably. MLLMs and GT differ clearly from random guess.
  • ...and 9 more figures