Probing Visual Language Priors in VLMs

Tiange Luo; Ang Cao; Gunhee Lee; Justin Johnson; Honglak Lee

Probing Visual Language Priors in VLMs

Tiange Luo, Ang Cao, Gunhee Lee, Justin Johnson, Honglak Lee

TL;DR

The paper introduces ViLP, a benchmark designed to probe whether Vision-Language Models depend on visual language priors rather than true visual reasoning by using out-of-distribution images and distractor-informed questions. It demonstrates a substantial gap between human performance and current VLMs on ViLP, motivating a self-improvement approach called Image-DPO that generates and corrupts VQA data to emphasize visual inputs. The Image-DPO objective is shown to upper-bound the RLHF objective, and experiments across open-source models like LLaVA-v1.5 and Cambrian show consistent performance gains. The work provides a scalable data-generation pipeline, theoretical links to learning from feedback, and releases to support future research in improving visual reasoning in VLMs.

Abstract

Despite recent advances in Vision-Language Models (VLMs), they may over-rely on visual language priors existing in their training data rather than true visual reasoning. To investigate this, we introduce ViLP, a benchmark featuring deliberately out-of-distribution images synthesized via image generation models and out-of-distribution Q&A pairs. Each question in ViLP is coupled with three potential answers and three corresponding images: one that can be resolved by text priors alone and two that demand visual reasoning. Although, humans achieve near-perfect accuracy, modern VLMs falter; for instance, GPT-4 achieves only 66.17% on ViLP. To alleviate this, we propose a self-improving framework in which models generate new VQA data, then apply pixel-level and semantic corruptions to form "good-bad" image pairs for self-training. Our training objectives compel VLMs to focus more on the actual visual inputs, and we demonstrate their effectiveness in boosting the performance of open-source VLMs, including LLaVA-v1.5 and Cambrian.

Probing Visual Language Priors in VLMs

TL;DR

Abstract

Paper Structure (30 sections, 1 theorem, 19 equations, 35 figures, 5 tables)

This paper contains 30 sections, 1 theorem, 19 equations, 35 figures, 5 tables.

Introduction
Related Work
ViLP Benchmark
Design Principles
Question-Image-Answer Generation
Dataset Evaluation
Image DPO
Objective
Data Generation
Experiments
Image DPO
Conclusion
Acknowledgment
More details and comparisons of our benchmarks
More data samples of ViLP
...and 15 more sections

Key Result

Proposition 1

Let $\mathcal{L}_{\mathrm{RLHF}}(\pi_\theta, \pi_{\mathrm{ref}}; \mathcal{S})$ be the KL-constrained reward maximization objective in Appendix Eq. apx:rlhf_objrafailov2024direct, where the dataset $\mathcal{S} = \{(Q, A, I_w, I_l)\}$ contains good images $I_w$ and corrupted images $I_l$. Let $\mathc

Figures (35)

Figure 1: Sample data from ViLP. For the same question, ViLP provides three distinct images and corresponding answers (upper-left corner). All questions follow a consistent structure, combining a distractor fact with a question. The Prior Answer (first column) can be directly inferred from the question, while Test Answers (second $\&$ third column) rely on visual cues. Our answers are designed to be single words, and both the model and human evaluators are tasked with open-domain answering, rather than selecting from predefined options. To support this, we have developed a robust synonym and plural detection pipeline, ensuring that open-ended responses do not hinder the evaluation process. This approach also enables evaluation without relying on LLMs. Please refer to Appendix \ref{['appen:more_examples']} for more data samples from ViLP. We investigate the impact of image styles in Appendix \ref{['appen:realistic']}, where we generate more realistic images using https://openai.com/index/introducing-4o-image-generation/. Furthermore, we include both qualitative and quantitative comparison results with Winoground thrush2022winoground, Whoops!BittonGuetta2023BreakingCS, and HallusionBenchGuan2023HallusionbenchAA in Appendix \ref{['appen:data_cmp']}.
Figure 2: Qualitative examples. We show the results from GPT-4o, Claude-3.5-Sonnet, Gemini-1.5-Pro, and Llama-3.2-Vision-90B for some challenging cases. Please refer to Appendix \ref{['appen:failure_cases']} for more failure case analysis.
Figure 3: Comparison of benchmark scores under different image transformations. Solid line and dotted line refer to ViLPF-Score and ViLPF-Prior, respectively.
Figure 4: Qualitative results before and after removingdistactor facts. GPT-4o and LLaVA-1.5-13B models yield completely opposite behaviors.
Figure 5: Randomly sampled data from ViLP.
...and 30 more figures

Theorems & Definitions (1)

Proposition 1

Probing Visual Language Priors in VLMs

TL;DR

Abstract

Probing Visual Language Priors in VLMs

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (35)

Theorems & Definitions (1)