Table of Contents
Fetching ...

VP-Bench: A Comprehensive Benchmark for Visual Prompting in Multimodal Large Language Models

Mingjie Xu, Jinpeng Chen, Yuzhi Zhao, Jason Chun Lok Li, Yue Qiu, Zekang Du, Mengyang Wu, Pingping Zhang, Kun Li, Hongzheng Yang, Wenao Ma, Jiaheng Wei, Qinbin Li, Kangcheng Liu, Wenqiang Lei

TL;DR

VP-Bench addresses how visual prompts are perceived and utilized by multimodal LLMs through a two-stage benchmark. It assembles a large VP corpus with 30k prompts across eight shapes and 355 attribute combinations, evaluated on 34,267 images and 38,932 QA pairs, to probe VP perception. It also assesses the impact of VP on six downstream, VP-enabled tasks using best-performing VP configurations, across 28 MLLMs including open-source and proprietary systems. Key findings show that regular VP shapes are easier to perceive, textual VP descriptions improve alignment, and model scale strengthens VP perception and downstream performance, though domain knowledge remains influential. The benchmark provides a comprehensive reference for designing VP-friendly prompts and guiding future improvements in grounded referring and spatial understanding in MLLMs.

Abstract

Multimodal large language models (MLLMs) have enabled a wide range of advanced vision-language applications, including fine-grained object recognition and contextual understanding. When querying specific regions or objects in an image, human users naturally use "visual prompts" (VPs), such as bounding boxes, to provide reference. However, no existing benchmark systematically evaluates the ability of MLLMs to interpret such VPs. This gap leaves it unclear whether current MLLMs can effectively recognize VPs, an intuitive prompting method for humans, and use them to solve problems. To address this limitation, we introduce VP-Bench, a benchmark for assessing MLLMs' capability in VP perception and utilization. VP-Bench employs a two-stage evaluation framework: Stage 1 examines models' ability to perceive VPs in natural scenes, using 30k visualized prompts spanning eight shapes and 355 attribute combinations. Stage 2 investigates the impact of VPs on downstream tasks, measuring their effectiveness in real-world problem-solving scenarios. Using VP-Bench, we evaluate 28 MLLMs, including proprietary systems (e.g., GPT-4o) and open-source models (e.g., InternVL3 and Qwen2.5-VL), and provide a comprehensive analysis of factors that affect VP understanding, such as variations in VP attributes, question arrangement, and model scale. VP-Bench establishes a new reference framework for studying how MLLMs comprehend and resolve grounded referring questions.

VP-Bench: A Comprehensive Benchmark for Visual Prompting in Multimodal Large Language Models

TL;DR

VP-Bench addresses how visual prompts are perceived and utilized by multimodal LLMs through a two-stage benchmark. It assembles a large VP corpus with 30k prompts across eight shapes and 355 attribute combinations, evaluated on 34,267 images and 38,932 QA pairs, to probe VP perception. It also assesses the impact of VP on six downstream, VP-enabled tasks using best-performing VP configurations, across 28 MLLMs including open-source and proprietary systems. Key findings show that regular VP shapes are easier to perceive, textual VP descriptions improve alignment, and model scale strengthens VP perception and downstream performance, though domain knowledge remains influential. The benchmark provides a comprehensive reference for designing VP-friendly prompts and guiding future improvements in grounded referring and spatial understanding in MLLMs.

Abstract

Multimodal large language models (MLLMs) have enabled a wide range of advanced vision-language applications, including fine-grained object recognition and contextual understanding. When querying specific regions or objects in an image, human users naturally use "visual prompts" (VPs), such as bounding boxes, to provide reference. However, no existing benchmark systematically evaluates the ability of MLLMs to interpret such VPs. This gap leaves it unclear whether current MLLMs can effectively recognize VPs, an intuitive prompting method for humans, and use them to solve problems. To address this limitation, we introduce VP-Bench, a benchmark for assessing MLLMs' capability in VP perception and utilization. VP-Bench employs a two-stage evaluation framework: Stage 1 examines models' ability to perceive VPs in natural scenes, using 30k visualized prompts spanning eight shapes and 355 attribute combinations. Stage 2 investigates the impact of VPs on downstream tasks, measuring their effectiveness in real-world problem-solving scenarios. Using VP-Bench, we evaluate 28 MLLMs, including proprietary systems (e.g., GPT-4o) and open-source models (e.g., InternVL3 and Qwen2.5-VL), and provide a comprehensive analysis of factors that affect VP understanding, such as variations in VP attributes, question arrangement, and model scale. VP-Bench establishes a new reference framework for studying how MLLMs comprehend and resolve grounded referring questions.

Paper Structure

This paper contains 22 sections, 15 figures, 18 tables.

Figures (15)

  • Figure 1: Overview of the VP-Bench Dataset. VP-Bench introduces a two-stage evaluation framework: (1) Model Perception, which assesses general VP recognition capabilities using 30K visualized VPs spanning five question types; and (2) VP Effect on Downstream Tasks, which evaluates the impact of visual prompts on various downstream applications. All questions follow a multiple-choice format, but the full list of options is not displayed due to space limitations.
  • Figure 2: An illustration of our pipeline for data collection. Stage 1 is used to determine the general capabilities of MLLMs in recognizing VPs, while Stage 2 clarifies the impact of using VPs on downstream tasks.
  • Figure 3: Benchmark Stage 1 Q&As generation pipeline.
  • Figure 4: Benchmark Stage 2 Q&As generation pipeline.
  • Figure 5: Qualitative results of arrow shape in Stage 1 evaluation.
  • ...and 10 more figures