Table of Contents
Fetching ...

Binary Verification for Zero-Shot Vision

Jeffrey Liu, Rongbin Hu

TL;DR

This work tackles the unreliability of open-ended zero-shot vision queries by proposing a training-free binary verification workflow that first quantizes a query into a small MCQ shortlist and then binarizes per-candidate checks with deterministic resolution. Using off-the-shelf VLMs, it demonstrates significant gains across five tasks spanning Referring Expression Grounding and Spatial-Reasoning benchmarks, without task-specific fine-tuning. The authors provide a compact theory showing a hardness ladder from $K$-way to MCQ to binary under $0$–$1$ loss and an information-theoretic justification that calibrated posteriors can recover Bayes decisions, along with a capability model explaining when simpler prompts yield stronger signals. Collectively, this yields a simple, unified, training-free approach that improves zero-shot vision performance and offers a practical path for deployment with today’s vision-language models.

Abstract

We propose a training-free, binary verification workflow for zero-shot vision with off-the-shelf VLMs. It comprises two steps: (i) quantization, which turns the open-ended query into a multiple-choice question (MCQ) with a small, explicit list of unambiguous candidates; and (ii) binarization, which asks one True/False question per candidate and resolves deterministically: if exactly one is True, select it; otherwise, revert to an MCQ over the remaining plausible candidates. We evaluate the workflow on referring expression grounding (REC), spatial reasoning (Spatial-Map, Spatial-Grid, Spatial-Maze), and BLINK-Jigsaw. Relative to answering open-ended queries directly, quantization to MCQ yields large gains, and True/False binarization provides a consistent additional boost. Across all tasks, the same workflow produces significant improvements, indicating generality. Our theory formalizes how open-ended vision queries can be quantized to MCQs and further binarized into True/False verifications, establishing a hardness ladder. A simple analysis explains why Boolean resolution boosts accuracy. Together, these components yield a simple and unified workflow that emphasizes inference-time design over task-specific training. It offers a practical, drop-in path to stronger zero-shot vision with today's VLMs.

Binary Verification for Zero-Shot Vision

TL;DR

This work tackles the unreliability of open-ended zero-shot vision queries by proposing a training-free binary verification workflow that first quantizes a query into a small MCQ shortlist and then binarizes per-candidate checks with deterministic resolution. Using off-the-shelf VLMs, it demonstrates significant gains across five tasks spanning Referring Expression Grounding and Spatial-Reasoning benchmarks, without task-specific fine-tuning. The authors provide a compact theory showing a hardness ladder from -way to MCQ to binary under loss and an information-theoretic justification that calibrated posteriors can recover Bayes decisions, along with a capability model explaining when simpler prompts yield stronger signals. Collectively, this yields a simple, unified, training-free approach that improves zero-shot vision performance and offers a practical path for deployment with today’s vision-language models.

Abstract

We propose a training-free, binary verification workflow for zero-shot vision with off-the-shelf VLMs. It comprises two steps: (i) quantization, which turns the open-ended query into a multiple-choice question (MCQ) with a small, explicit list of unambiguous candidates; and (ii) binarization, which asks one True/False question per candidate and resolves deterministically: if exactly one is True, select it; otherwise, revert to an MCQ over the remaining plausible candidates. We evaluate the workflow on referring expression grounding (REC), spatial reasoning (Spatial-Map, Spatial-Grid, Spatial-Maze), and BLINK-Jigsaw. Relative to answering open-ended queries directly, quantization to MCQ yields large gains, and True/False binarization provides a consistent additional boost. Across all tasks, the same workflow produces significant improvements, indicating generality. Our theory formalizes how open-ended vision queries can be quantized to MCQs and further binarized into True/False verifications, establishing a hardness ladder. A simple analysis explains why Boolean resolution boosts accuracy. Together, these components yield a simple and unified workflow that emphasizes inference-time design over task-specific training. It offers a practical, drop-in path to stronger zero-shot vision with today's VLMs.

Paper Structure

This paper contains 14 sections, 12 equations, 4 figures, 6 tables.

Figures (4)

  • Figure 1: An example of binary verification workflow for REC.
  • Figure 2: Illustrations for the Spatial-Reasoning family: for (a), we restrict to pairwise relations; for (b), we fix the query cell to (3,3), which is empirically harder; for (c), we directly request the full path (start, turns, end).
  • Figure 3: Explicit spatial quantization via visible grid overlays.
  • Figure 4: An example of BLINK-Jigsaw.