Binary Verification for Zero-Shot Vision
Jeffrey Liu, Rongbin Hu
TL;DR
This work tackles the unreliability of open-ended zero-shot vision queries by proposing a training-free binary verification workflow that first quantizes a query into a small MCQ shortlist and then binarizes per-candidate checks with deterministic resolution. Using off-the-shelf VLMs, it demonstrates significant gains across five tasks spanning Referring Expression Grounding and Spatial-Reasoning benchmarks, without task-specific fine-tuning. The authors provide a compact theory showing a hardness ladder from $K$-way to MCQ to binary under $0$–$1$ loss and an information-theoretic justification that calibrated posteriors can recover Bayes decisions, along with a capability model explaining when simpler prompts yield stronger signals. Collectively, this yields a simple, unified, training-free approach that improves zero-shot vision performance and offers a practical path for deployment with today’s vision-language models.
Abstract
We propose a training-free, binary verification workflow for zero-shot vision with off-the-shelf VLMs. It comprises two steps: (i) quantization, which turns the open-ended query into a multiple-choice question (MCQ) with a small, explicit list of unambiguous candidates; and (ii) binarization, which asks one True/False question per candidate and resolves deterministically: if exactly one is True, select it; otherwise, revert to an MCQ over the remaining plausible candidates. We evaluate the workflow on referring expression grounding (REC), spatial reasoning (Spatial-Map, Spatial-Grid, Spatial-Maze), and BLINK-Jigsaw. Relative to answering open-ended queries directly, quantization to MCQ yields large gains, and True/False binarization provides a consistent additional boost. Across all tasks, the same workflow produces significant improvements, indicating generality. Our theory formalizes how open-ended vision queries can be quantized to MCQs and further binarized into True/False verifications, establishing a hardness ladder. A simple analysis explains why Boolean resolution boosts accuracy. Together, these components yield a simple and unified workflow that emphasizes inference-time design over task-specific training. It offers a practical, drop-in path to stronger zero-shot vision with today's VLMs.
