VB: Visibility Benchmark for Visibility and Perspective Reasoning in Images

Neil Tripathi

VB: Visibility Benchmark for Visibility and Perspective Reasoning in Images

Neil Tripathi

TL;DR

Unlike prior unanswerable-VQA benchmarks, VB tests not only whether a question is unanswerable but why, and uses controlled minimal edits to verify that model judgments change when and only when the underlying evidence changes.

Abstract

We present VB, a benchmark that tests whether vision-language models can determine what is and is not visible in a photograph, and abstain when a human viewer cannot reliably answer. Each item pairs a single photo with a short yes/no visibility claim; the model must output VISIBLY_TRUE, VISIBLY_FALSE, or ABSTAIN, together with a confidence score. Items are organized into 100 families using a 2x2 design that crosses a minimal image edit with a minimal text edit, yielding 300 headline evaluation cells. Unlike prior unanswerable-VQA benchmarks, VB tests not only whether a question is unanswerable but why (via reason codes tied to specific visibility factors), and uses controlled minimal edits to verify that model judgments change when and only when the underlying evidence changes. We score models on confidence-aware accuracy with abstention (CAA), minimal-edit flip rate (MEFR), confidence-ranked selective prediction (SelRank), and second-order perspective reasoning (ToMAcc); all headline numbers are computed on the strict XOR subset (three cells per family, 300 scored items per model). We evaluate nine models spanning flagship and prior-generation closed-source systems, and open-source models from 8B to 12B parameters. GPT-4o and Gemini 3.1 Pro effectively tie for the best composite score (0.728 and 0.727), followed by Gemini 2.5 Pro (0.678). The best open-source model, Gemma 3 12B (0.505), surpasses one prior-generation closed-source system. Text-flip robustness exceeds image-flip robustness for six of nine models, and confidence calibration varies substantially: GPT-4o and Gemini 2.5 Pro achieve similar accuracy yet differ sharply in selective prediction quality.

VB: Visibility Benchmark for Visibility and Perspective Reasoning in Images

TL;DR

Abstract

Paper Structure (57 sections, 11 equations, 1 figure, 3 tables)

This paper contains 57 sections, 11 equations, 1 figure, 3 tables.

Introduction
Setup at a glance.
Contributions.
Scope.
Data release.
Qualitative examples
Note on ABSTAIN examples.
Related work
Unanswerable visual questions and withholding judgment.
Visibility factors: gaze, occlusion, and field of view.
Hallucination and faithfulness in vision-language models.
Selective prediction, reject options, and risk-coverage.
Second-order perspective and theory-of-mind style probes.
Benchmark design
Task definition and labels
...and 42 more sections

Figures (1)

Figure 1: Representative base/flip examples (one family per primary category). Each pair shows the BASE cell $(I^0,q^0)$ and IMAGE_FLIP cell $(I^1,q^0)$. Most pairs follow the strict XOR pattern: BASE is VF and IMAGE_FLIP is VT. The IC pair illustrates ABSTAIN. (VT=VISIBLY_TRUE, VF=VISIBLY_FALSE.)

VB: Visibility Benchmark for Visibility and Perspective Reasoning in Images

TL;DR

Abstract

VB: Visibility Benchmark for Visibility and Perspective Reasoning in Images

Authors

TL;DR

Abstract

Table of Contents

Figures (1)