Table of Contents
Fetching ...

Probing Visual Language Priors in VLMs

Tiange Luo, Ang Cao, Gunhee Lee, Justin Johnson, Honglak Lee

TL;DR

The paper introduces ViLP, a benchmark designed to probe whether Vision-Language Models depend on visual language priors rather than true visual reasoning by using out-of-distribution images and distractor-informed questions. It demonstrates a substantial gap between human performance and current VLMs on ViLP, motivating a self-improvement approach called Image-DPO that generates and corrupts VQA data to emphasize visual inputs. The Image-DPO objective is shown to upper-bound the RLHF objective, and experiments across open-source models like LLaVA-v1.5 and Cambrian show consistent performance gains. The work provides a scalable data-generation pipeline, theoretical links to learning from feedback, and releases to support future research in improving visual reasoning in VLMs.

Abstract

Despite recent advances in Vision-Language Models (VLMs), they may over-rely on visual language priors existing in their training data rather than true visual reasoning. To investigate this, we introduce ViLP, a benchmark featuring deliberately out-of-distribution images synthesized via image generation models and out-of-distribution Q&A pairs. Each question in ViLP is coupled with three potential answers and three corresponding images: one that can be resolved by text priors alone and two that demand visual reasoning. Although, humans achieve near-perfect accuracy, modern VLMs falter; for instance, GPT-4 achieves only 66.17% on ViLP. To alleviate this, we propose a self-improving framework in which models generate new VQA data, then apply pixel-level and semantic corruptions to form "good-bad" image pairs for self-training. Our training objectives compel VLMs to focus more on the actual visual inputs, and we demonstrate their effectiveness in boosting the performance of open-source VLMs, including LLaVA-v1.5 and Cambrian.

Probing Visual Language Priors in VLMs

TL;DR

The paper introduces ViLP, a benchmark designed to probe whether Vision-Language Models depend on visual language priors rather than true visual reasoning by using out-of-distribution images and distractor-informed questions. It demonstrates a substantial gap between human performance and current VLMs on ViLP, motivating a self-improvement approach called Image-DPO that generates and corrupts VQA data to emphasize visual inputs. The Image-DPO objective is shown to upper-bound the RLHF objective, and experiments across open-source models like LLaVA-v1.5 and Cambrian show consistent performance gains. The work provides a scalable data-generation pipeline, theoretical links to learning from feedback, and releases to support future research in improving visual reasoning in VLMs.

Abstract

Despite recent advances in Vision-Language Models (VLMs), they may over-rely on visual language priors existing in their training data rather than true visual reasoning. To investigate this, we introduce ViLP, a benchmark featuring deliberately out-of-distribution images synthesized via image generation models and out-of-distribution Q&A pairs. Each question in ViLP is coupled with three potential answers and three corresponding images: one that can be resolved by text priors alone and two that demand visual reasoning. Although, humans achieve near-perfect accuracy, modern VLMs falter; for instance, GPT-4 achieves only 66.17% on ViLP. To alleviate this, we propose a self-improving framework in which models generate new VQA data, then apply pixel-level and semantic corruptions to form "good-bad" image pairs for self-training. Our training objectives compel VLMs to focus more on the actual visual inputs, and we demonstrate their effectiveness in boosting the performance of open-source VLMs, including LLaVA-v1.5 and Cambrian.
Paper Structure (30 sections, 1 theorem, 19 equations, 35 figures, 5 tables)

This paper contains 30 sections, 1 theorem, 19 equations, 35 figures, 5 tables.

Key Result

Proposition 1

Let $\mathcal{L}_{\mathrm{RLHF}}(\pi_\theta, \pi_{\mathrm{ref}}; \mathcal{S})$ be the KL-constrained reward maximization objective in Appendix Eq. apx:rlhf_objrafailov2024direct, where the dataset $\mathcal{S} = \{(Q, A, I_w, I_l)\}$ contains good images $I_w$ and corrupted images $I_l$. Let $\mathc

Figures (35)

  • Figure 1: Sample data from ViLP. For the same question, ViLP provides three distinct images and corresponding answers (upper-left corner). All questions follow a consistent structure, combining a distractor fact with a question. The Prior Answer (first column) can be directly inferred from the question, while Test Answers (second $\&$ third column) rely on visual cues. Our answers are designed to be single words, and both the model and human evaluators are tasked with open-domain answering, rather than selecting from predefined options. To support this, we have developed a robust synonym and plural detection pipeline, ensuring that open-ended responses do not hinder the evaluation process. This approach also enables evaluation without relying on LLMs. Please refer to Appendix \ref{['appen:more_examples']} for more data samples from ViLP. We investigate the impact of image styles in Appendix \ref{['appen:realistic']}, where we generate more realistic images using https://openai.com/index/introducing-4o-image-generation/. Furthermore, we include both qualitative and quantitative comparison results with Winoground thrush2022winoground, Whoops!BittonGuetta2023BreakingCS, and HallusionBenchGuan2023HallusionbenchAA in Appendix \ref{['appen:data_cmp']}.
  • Figure 2: Qualitative examples. We show the results from GPT-4o, Claude-3.5-Sonnet, Gemini-1.5-Pro, and Llama-3.2-Vision-90B for some challenging cases. Please refer to Appendix \ref{['appen:failure_cases']} for more failure case analysis.
  • Figure 3: Comparison of benchmark scores under different image transformations. Solid line and dotted line refer to ViLPF-Score and ViLPF-Prior, respectively.
  • Figure 4: Qualitative results before and after removingdistactor facts. GPT-4o and LLaVA-1.5-13B models yield completely opposite behaviors.
  • Figure 5: Randomly sampled data from ViLP.
  • ...and 30 more figures

Theorems & Definitions (1)

  • Proposition 1