Do MLLMs Exhibit Human-like Perceptual Behaviors? HVSBench: A Benchmark for MLLM Alignment with Human Perceptual Behavior
Jiaying Lin, Shuquan Ye, Dan Xu, Wanli Ouyang, Rynson W. H. Lau
TL;DR
HVSBench introduces a large-scale, human-centric benchmark to evaluate how well Multimodal LLMs align with human perceptual behavior across five key visual fields. The authors establish a robust automatic standardization and evaluation protocol, and comprehensively compare 26 MLLMs and humans, revealing a sizable gap between models and human perceptual alignment. Their results show that larger and newer models improve but still struggle with scanpath prediction and field-specific tasks, highlighting the need for human-aligned design. The work also discusses practical benefits, including improved QA, captioning, and content generation, and provides concrete datasets and methodologies to advance human-aligned AI systems.
Abstract
While Multimodal Large Language Models (MLLMs) excel at many vision tasks, it is unknown if they exhibit human-like perceptual behaviors. To evaluate this, we introduce HVSBench, the first large-scale benchmark with over 85,000 samples designed to test MLLM alignment with the human visual system (HVS). The benchmark covers 13 categories across 5 key fields: Prominence, Subitizing, Prioritizing, Free-Viewing, and Searching. Our comprehensive evaluation reveals a significant perceptual gap: even state-of-the-art MLLMs achieve only moderate results. In contrast, human participants demonstrate strong performance, significantly outperforming all models. This underscores the high quality of HVSBench and the need for more human-aligned AI. We believe our benchmark will be a critical tool for developing the next generation of explainable MLLMs.
