MVP-Bench: Can Large Vision--Language Models Conduct Multi-level Visual Perception Like Humans?

Guanzhen Li; Yuxi Xie; Min-Yen Kan

MVP-Bench: Can Large Vision--Language Models Conduct Multi-level Visual Perception Like Humans?

Guanzhen Li, Yuxi Xie, Min-Yen Kan

TL;DR

The first visual-language benchmark systematically evaluating both low- and high-level visual perception of LVLMs is introduced, and the performance gap between natural and manipulated images indicates that current LVLMs do not generalize in understanding the visual semantics of synthetic images as humans do.

Abstract

Humans perform visual perception at multiple levels, including low-level object recognition and high-level semantic interpretation such as behavior understanding. Subtle differences in low-level details can lead to substantial changes in high-level perception. For example, substituting the shopping bag held by a person with a gun suggests violent behavior, implying criminal or violent activity. Despite significant advancements in various multimodal tasks, Large Visual-Language Models (LVLMs) remain unexplored in their capabilities to conduct such multi-level visual perceptions. To investigate the perception gap between LVLMs and humans, we introduce MVP-Bench, the first visual-language benchmark systematically evaluating both low- and high-level visual perception of LVLMs. We construct MVP-Bench across natural and synthetic images to investigate how manipulated content influences model perception. Using MVP-Bench, we diagnose the visual perception of 10 open-source and 2 closed-source LVLMs, showing that high-level perception tasks significantly challenge existing LVLMs. The state-of-the-art GPT-4o only achieves an accuracy of $56\%$ on Yes/No questions, compared with $74\%$ in low-level scenarios. Furthermore, the performance gap between natural and manipulated images indicates that current LVLMs do not generalize in understanding the visual semantics of synthetic images as humans do. Our data and code are publicly available at https://github.com/GuanzhenLi/MVP-Bench.

MVP-Bench: Can Large Vision--Language Models Conduct Multi-level Visual Perception Like Humans?

TL;DR

Abstract

on Yes/No questions, compared with

in low-level scenarios. Furthermore, the performance gap between natural and manipulated images indicates that current LVLMs do not generalize in understanding the visual semantics of synthetic images as humans do. Our data and code are publicly available at https://github.com/GuanzhenLi/MVP-Bench.

Paper Structure (31 sections, 1 equation, 11 figures, 6 tables)

This paper contains 31 sections, 1 equation, 11 figures, 6 tables.

Introduction
Related Work
Visual Perception.
Vision--Language Benchmarks.
Synthetic Images.
MVP-Bench Evaluation Suite
Evaluation across Perception Levels
Evaluation with Image Pairs
MVP-Bench Construction
Construction Pipeline
Step one: Idea Generation.
Step two: Manipulated Image Generation.
Step three: Visual Question Generation.
MVP-Bench Statistics
Experiments
...and 16 more sections

Figures (11)

Figure 1: A sample of MVP-Bench manifesting both high- and low-level visual perception. Image 1 and Image 2 form an image pair. Their different backgrounds indicate that the man is engaged in different behaviours.
Figure 2: MVP-Bench three-step construction pipeline (best viewed in color). Step 1 uses three categories ('Behaviour-Background', 'Role-Clothes', 'Emotion-Facial Expression') as examples to illustrate how high-level perception guides the identification of low-level perception. Step 2 demonstrates three categories of manipulated image generation: Overall Background Substitution, Partial Component Substitution, and Direct Alteration (from left to right). Step 3 explains how to generate questions based on the ideas obtained in Step 1, with the same colour indicating that the generated question is based on the corresponding part from the expected perception.
Figure 3: MVP-Bench statistics. (a) shows 5 high-level (${L_h}$) categories and 13 low-level (${L_l}$) categories, where the mapping relationship indicates that the low-level features can support certain high-level perceptions. (b) shows the distribution of questions. Y/N, CI, MCQ denote Yes/No questions, cross-image questions, and single-image multiple-choice questions respectively. (c) demonstrates the distribution of images with questions at different levels. (d) and (e) demonstrate that our pipeline successfully generates pairs of images with significantly distinct content.
Figure 4: Case study. We highlight the incorrect and correct part of the answer.
Figure 5: Cases for 'Behaviour-Background' and 'Behaviour-Movement' categories.
...and 6 more figures

MVP-Bench: Can Large Vision--Language Models Conduct Multi-level Visual Perception Like Humans?

TL;DR

Abstract

MVP-Bench: Can Large Vision--Language Models Conduct Multi-level Visual Perception Like Humans?

Authors

TL;DR

Abstract

Table of Contents

Figures (11)