Probing Visual Language Priors in VLMs
Tiange Luo, Ang Cao, Gunhee Lee, Justin Johnson, Honglak Lee
TL;DR
The paper introduces ViLP, a benchmark designed to probe whether Vision-Language Models depend on visual language priors rather than true visual reasoning by using out-of-distribution images and distractor-informed questions. It demonstrates a substantial gap between human performance and current VLMs on ViLP, motivating a self-improvement approach called Image-DPO that generates and corrupts VQA data to emphasize visual inputs. The Image-DPO objective is shown to upper-bound the RLHF objective, and experiments across open-source models like LLaVA-v1.5 and Cambrian show consistent performance gains. The work provides a scalable data-generation pipeline, theoretical links to learning from feedback, and releases to support future research in improving visual reasoning in VLMs.
Abstract
Despite recent advances in Vision-Language Models (VLMs), they may over-rely on visual language priors existing in their training data rather than true visual reasoning. To investigate this, we introduce ViLP, a benchmark featuring deliberately out-of-distribution images synthesized via image generation models and out-of-distribution Q&A pairs. Each question in ViLP is coupled with three potential answers and three corresponding images: one that can be resolved by text priors alone and two that demand visual reasoning. Although, humans achieve near-perfect accuracy, modern VLMs falter; for instance, GPT-4 achieves only 66.17% on ViLP. To alleviate this, we propose a self-improving framework in which models generate new VQA data, then apply pixel-level and semantic corruptions to form "good-bad" image pairs for self-training. Our training objectives compel VLMs to focus more on the actual visual inputs, and we demonstrate their effectiveness in boosting the performance of open-source VLMs, including LLaVA-v1.5 and Cambrian.
