Table of Contents
Fetching ...

Visually Prompted Benchmarks Are Surprisingly Fragile

Haiwen Feng, Long Lian, Lisa Dunlap, Jiahao Shu, XuDong Wang, Renhao Wang, Trevor Darrell, Alane Suhr, Angjoo Kanazawa

TL;DR

The paper interrogates whether vision-language models genuinely ground visual content when evaluated with visually prompted tasks, revealing that tiny prompt-design and implementation details can drastically alter outcomes. It shows that sampling variance and low-level inference choices induce substantial leaderboard fragility, often outweighing true perceptual differences. By expanding to VPBench with 35,088 annotated images and 16 visual marker variants, the authors demonstrate reduced variance and more reliable cross-model comparisons. They advocate standardized, uncertainty-aware evaluation practices and provide VPBench and tooling to enable robust, reproducible benchmarking of visual grounding in VLMs.

Abstract

A key challenge in evaluating VLMs is testing models' ability to analyze visual content independently from their textual priors. Recent benchmarks such as BLINK probe visual perception through visual prompting, where questions about visual content are paired with coordinates to which the question refers, with the coordinates explicitly marked in the image itself. While these benchmarks are an important part of VLM evaluation, we find that existing models are surprisingly fragile to seemingly irrelevant details of visual prompting: simply changing a visual marker from red to blue can completely change rankings among models on a leaderboard. By evaluating nine commonly-used open- and closed-source VLMs on two visually prompted tasks, we demonstrate how details in benchmark setup, including visual marker design and dataset size, have a significant influence on model performance and leaderboard rankings. These effects can even be exploited to lift weaker models above stronger ones; for instance, slightly increasing the size of the visual marker results in open-source InternVL3-8B ranking alongside or better than much larger proprietary models like Gemini 2.5 Pro. We further show that low-level inference choices that are often ignored in benchmarking, such as JPEG compression levels in API calls, can also cause model lineup changes. These details have substantially larger impacts on visually prompted benchmarks than on conventional semantic VLM evaluations. To mitigate this instability, we curate existing datasets to create VPBench, a larger visually prompted benchmark with 16 visual marker variants. VPBench and additional analysis tools are released at https://lisadunlap.github.io/vpbench/.

Visually Prompted Benchmarks Are Surprisingly Fragile

TL;DR

The paper interrogates whether vision-language models genuinely ground visual content when evaluated with visually prompted tasks, revealing that tiny prompt-design and implementation details can drastically alter outcomes. It shows that sampling variance and low-level inference choices induce substantial leaderboard fragility, often outweighing true perceptual differences. By expanding to VPBench with 35,088 annotated images and 16 visual marker variants, the authors demonstrate reduced variance and more reliable cross-model comparisons. They advocate standardized, uncertainty-aware evaluation practices and provide VPBench and tooling to enable robust, reproducible benchmarking of visual grounding in VLMs.

Abstract

A key challenge in evaluating VLMs is testing models' ability to analyze visual content independently from their textual priors. Recent benchmarks such as BLINK probe visual perception through visual prompting, where questions about visual content are paired with coordinates to which the question refers, with the coordinates explicitly marked in the image itself. While these benchmarks are an important part of VLM evaluation, we find that existing models are surprisingly fragile to seemingly irrelevant details of visual prompting: simply changing a visual marker from red to blue can completely change rankings among models on a leaderboard. By evaluating nine commonly-used open- and closed-source VLMs on two visually prompted tasks, we demonstrate how details in benchmark setup, including visual marker design and dataset size, have a significant influence on model performance and leaderboard rankings. These effects can even be exploited to lift weaker models above stronger ones; for instance, slightly increasing the size of the visual marker results in open-source InternVL3-8B ranking alongside or better than much larger proprietary models like Gemini 2.5 Pro. We further show that low-level inference choices that are often ignored in benchmarking, such as JPEG compression levels in API calls, can also cause model lineup changes. These details have substantially larger impacts on visually prompted benchmarks than on conventional semantic VLM evaluations. To mitigate this instability, we curate existing datasets to create VPBench, a larger visually prompted benchmark with 16 visual marker variants. VPBench and additional analysis tools are released at https://lisadunlap.github.io/vpbench/.

Paper Structure

This paper contains 27 sections, 3 equations, 19 figures.

Figures (19)

  • Figure 1: Small, seemingly irrelevant changes in visual prompting dramatically alter VLM predictions.Left: Qwen2.5-VL accuracy under different visual marker variants. Changes in marker size, shape, color, and label position lead to significant accuracy swings up to 13%. Right: such variations can reorder leaderboards, with model rankings shifting even when nothing about the underlying task changes.
  • Figure 2: Examples of visually prompted tasks. Visually prompted tasks (VPTs) involve placing visual markers in the image to ask questions such as relative depth and semantic correspondence.
  • Figure 3: Larger benchmark datasets stabilize rankings. Accuracies and rankings of 9 VLMs on BLINK Relative Depth, BLINK Semantic Correspondence, VPBench Relative Depth (VPBench-RD), and VPBench Semantic Correspondence (VPBench-SC) using BLINK's default marker convention. Error bars show 95% confidence intervals. Compared to BLINK’s small test splits, the larger VPBench relative depth and semantic correspondence evaluations yield substantially narrower intervals, making ranking differences easier to interpret and less sensitive to sampling noise.
  • Figure 4: VPBench-RD Model Ranking per data split
  • Figure 5: VPBench-SC Model Ranking per data split
  • ...and 14 more figures