Table of Contents
Fetching ...

Contrastive Region Guidance: Improving Grounding in Vision-Language Models without Training

David Wan, Jaemin Cho, Elias Stengel-Eskin, Mohit Bansal

TL;DR

This work introduces Contrastive Region Guidance (CRG), a training-free guidance method that enables open-source VLMs to respond to visual prompts and explores alternative masking strategies for CRG, quantifies CRG's probability shift, and evaluates the role of region guidance strength, empirically validating CRG's design choices.

Abstract

Highlighting particularly relevant regions of an image can improve the performance of vision-language models (VLMs) on various vision-language (VL) tasks by guiding the model to attend more closely to these regions of interest. For example, VLMs can be given a "visual prompt", where visual markers such as bounding boxes delineate key image regions. However, current VLMs that can incorporate visual guidance are either proprietary and expensive or require costly training on curated data that includes visual prompts. We introduce Contrastive Region Guidance (CRG), a training-free guidance method that enables open-source VLMs to respond to visual prompts. CRG contrasts model outputs produced with and without visual prompts, factoring out biases revealed by the model when answering without the information required to produce a correct answer (i.e., the model's prior). CRG achieves substantial improvements in a wide variety of VL tasks: When region annotations are provided, CRG increases absolute accuracy by up to 11.1% on ViP-Bench, a collection of six diverse region-based tasks such as recognition, math, and object relationship reasoning. We also show CRG's applicability to spatial reasoning, with 10% improvement on What'sUp, as well as to compositional generalization -- improving accuracy by 11.5% and 7.5% on two challenging splits from SugarCrepe -- and to image-text alignment for generated images, where we improve by up to 8.4 AUROC and 6.8 F1 points on SeeTRUE. When reference regions are absent, CRG allows us to re-rank proposed regions in referring expression comprehension and phrase grounding benchmarks like RefCOCO/+/g and Flickr30K Entities, with an average gain of 3.2% in accuracy. Our analysis explores alternative masking strategies for CRG, quantifies CRG's probability shift, and evaluates the role of region guidance strength, empirically validating CRG's design choices.

Contrastive Region Guidance: Improving Grounding in Vision-Language Models without Training

TL;DR

This work introduces Contrastive Region Guidance (CRG), a training-free guidance method that enables open-source VLMs to respond to visual prompts and explores alternative masking strategies for CRG, quantifies CRG's probability shift, and evaluates the role of region guidance strength, empirically validating CRG's design choices.

Abstract

Highlighting particularly relevant regions of an image can improve the performance of vision-language models (VLMs) on various vision-language (VL) tasks by guiding the model to attend more closely to these regions of interest. For example, VLMs can be given a "visual prompt", where visual markers such as bounding boxes delineate key image regions. However, current VLMs that can incorporate visual guidance are either proprietary and expensive or require costly training on curated data that includes visual prompts. We introduce Contrastive Region Guidance (CRG), a training-free guidance method that enables open-source VLMs to respond to visual prompts. CRG contrasts model outputs produced with and without visual prompts, factoring out biases revealed by the model when answering without the information required to produce a correct answer (i.e., the model's prior). CRG achieves substantial improvements in a wide variety of VL tasks: When region annotations are provided, CRG increases absolute accuracy by up to 11.1% on ViP-Bench, a collection of six diverse region-based tasks such as recognition, math, and object relationship reasoning. We also show CRG's applicability to spatial reasoning, with 10% improvement on What'sUp, as well as to compositional generalization -- improving accuracy by 11.5% and 7.5% on two challenging splits from SugarCrepe -- and to image-text alignment for generated images, where we improve by up to 8.4 AUROC and 6.8 F1 points on SeeTRUE. When reference regions are absent, CRG allows us to re-rank proposed regions in referring expression comprehension and phrase grounding benchmarks like RefCOCO/+/g and Flickr30K Entities, with an average gain of 3.2% in accuracy. Our analysis explores alternative masking strategies for CRG, quantifies CRG's probability shift, and evaluates the role of region guidance strength, empirically validating CRG's design choices.
Paper Structure (27 sections, 2 equations, 11 figures, 6 tables)

This paper contains 27 sections, 2 equations, 11 figures, 6 tables.

Figures (11)

  • Figure 1: Comparison of different methods for visual grounding. (a) Predicting the answer with a base VLM fails. (b) Even when bounding boxes are added, open-source VLMs produce the wrong answer. (c) The VLM can be trained to recognize overlays like bounding boxes, but this process involves updating the VLM and is costly. (d) Our method, CRG, offers a way to correct predictions without training. The right image has relevant object regions blacked out. Here, the model's distribution reflects its prior on answering "under" and "left" even without visual evidence. By factoring this distribution out, we reduce the prior, leading to the correct answer.
  • Figure 2: Left: Illustration of our method, Contrastive Region Guidance (CRG), which guides VLMs to focus on specific regions of interest (ROI). Right: Applications of CRG to various VL tasks: (a): When answering a visual question with ROI, CRG guides a VLM to answer about the specific region. (b): Even when no specific regions are provided, we can leverage an object detector to find important objects and guide the VLM to focus on the objects. (c): For image-text alignment, CRG guides the model in generating text related to objects and their relationships found in images, leading to a higher probability of the correct text versus the incorrect text. (d): CRG can also help VLMs to find the region corresponding to a given text from a set of multiple region proposals by finding the mask that provides the largest contrast.
  • Figure 3: Different masking and overlaying strategies on the What'sUp dataset containing two objects. Top: Blacking out with different regions. Bottom: CRG's blackout strategy (e) contrasted with different methods for overlaying visual markers (f-h).
  • Figure 4: (a) shows an example of correct and incorrect texts from SugarCrepe Swap-Att. The correct text contains correct words $W_{C}$ (i.e., grey dog) that are reflected in the image, whereas the incorrect text has the attribute swapped to form the incorrect words $W_{i}$ (i.e., black dog). We compare the average probability assigned to all correct words $W_{C}$ and all incorrect words $W_{I}$ by LLaVA-1.6-34B and LLaVA-1.6-34B + CRG in (b).
  • Figure 5: Ablations of $\alpha$ on What'sUp (a), SugarCrepe (b), and SeeTRUE (c). We evaluate $\alpha$ between 0 and 1 on the top graphs, and between 1 and 10 on the bottom graphs.
  • ...and 6 more figures