Table of Contents
Fetching ...

Zero-Shot Referring Expression Comprehension via Vison-Language True/False Verification

Jeffrey Liu, Rongbin Hu

TL;DR

The paper addresses the challenge of referring expression comprehension (REC) without task-specific training by proposing a verification-first, zero-shot workflow. It reframes REC as per-box visual-language verification: a class-conditioned detector provides region proposals from the image, and a general vision-language model answers binary True/False queries for each region, with abstention and tie-break logic when needed. The method demonstrates strong zero-shot performance on RefCOCO, RefCOCO+, and RefCOCOg, outperforming the zero-shot GroundingDINO baseline and many REC-trained approaches, thereby showing that workflow design can drive major gains without task-specific pretraining. The work highlights that decoupling candidate proposals and verification reduces interference, enables robust abstention, and suggests that such modular designs can generalize to other complex vision-language tasks.

Abstract

Referring Expression Comprehension (REC) is usually addressed with task-trained grounding models. We show that a zero-shot workflow, without any REC-specific training, can achieve competitive or superior performance. Our approach reformulates REC as box-wise visual-language verification: given proposals from a COCO-clean generic detector (YOLO-World), a general-purpose VLM independently answers True/False queries for each region. This simple procedure reduces cross-box interference, supports abstention and multiple matches, and requires no fine-tuning. On RefCOCO, RefCOCO+, and RefCOCOg, our method not only surpasses a zero-shot GroundingDINO baseline but also exceeds reported results for GroundingDINO trained on REC and GroundingDINO+CRG. Controlled studies with identical proposals confirm that verification significantly outperforms selection-based prompting, and results hold with open VLMs. Overall, we show that workflow design, rather than task-specific pretraining, drives strong zero-shot REC performance.

Zero-Shot Referring Expression Comprehension via Vison-Language True/False Verification

TL;DR

The paper addresses the challenge of referring expression comprehension (REC) without task-specific training by proposing a verification-first, zero-shot workflow. It reframes REC as per-box visual-language verification: a class-conditioned detector provides region proposals from the image, and a general vision-language model answers binary True/False queries for each region, with abstention and tie-break logic when needed. The method demonstrates strong zero-shot performance on RefCOCO, RefCOCO+, and RefCOCOg, outperforming the zero-shot GroundingDINO baseline and many REC-trained approaches, thereby showing that workflow design can drive major gains without task-specific pretraining. The work highlights that decoupling candidate proposals and verification reduces interference, enables robust abstention, and suggests that such modular designs can generalize to other complex vision-language tasks.

Abstract

Referring Expression Comprehension (REC) is usually addressed with task-trained grounding models. We show that a zero-shot workflow, without any REC-specific training, can achieve competitive or superior performance. Our approach reformulates REC as box-wise visual-language verification: given proposals from a COCO-clean generic detector (YOLO-World), a general-purpose VLM independently answers True/False queries for each region. This simple procedure reduces cross-box interference, supports abstention and multiple matches, and requires no fine-tuning. On RefCOCO, RefCOCO+, and RefCOCOg, our method not only surpasses a zero-shot GroundingDINO baseline but also exceeds reported results for GroundingDINO trained on REC and GroundingDINO+CRG. Controlled studies with identical proposals confirm that verification significantly outperforms selection-based prompting, and results hold with open VLMs. Overall, we show that workflow design, rather than task-specific pretraining, drives strong zero-shot REC performance.

Paper Structure

This paper contains 7 sections, 3 equations, 2 figures, 1 table.

Figures (2)

  • Figure 1: The concept behind our verification-first workflow, which enables outstanding accuracy in a training-free zero-shot setting.
  • Figure 2: Threshold and gain analyses derived from the two-candidate model (Sec. \ref{['sec:analysis']}).