Table of Contents
Fetching ...

VLind-Bench: Measuring Language Priors in Large Vision-Language Models

Kang-il Lee, Minbeom Kim, Seunghyun Yoon, Minsung Kim, Dongryeol Lee, Hyukhun Koh, Kyomin Jung

TL;DR

VLind-Bench introduces a principled, pipelined benchmark to measure language priors in Large Vision-Language Models by decomposing counterfactual reasoning into four tests: commonsense knowledge, visual perception, commonsense bias, and language priors. The data-generation pipeline combines GPT-4-based counterfactual contexts, DALL-E 3 image synthesis, and rigorous human/LLM validation to create a controlled evaluation environment that disentangles priors from other deficits. Experimental results show most LVLMs rely heavily on language priors, with performance improving only for GPT-4o and RLHF-V-enabled models, and an inverse relationship between priors and backbone LLM scale. The work demonstrates that a pipelined evaluation can diagnose grounding problems and guide targeted improvements, including RLHF-V approaches that promote use of visual information for grounding responses. It provides a practical benchmark and diagnostic framework to push LVLM grounding toward more faithful visual reasoning.

Abstract

Large Vision-Language Models (LVLMs) have demonstrated outstanding performance across various multimodal tasks. However, they suffer from a problem known as language prior, where responses are generated based solely on textual patterns while disregarding image information. Addressing the issue of language prior is crucial, as it can lead to undesirable biases or hallucinations when dealing with images that are out of training distribution. Despite its importance, current methods for accurately measuring language priors in LVLMs are poorly studied. Although existing benchmarks based on counterfactual or out-of-distribution images can partially be used to measure language priors, they fail to disentangle language priors from other confounding factors. To this end, we propose a new benchmark called VLind-Bench, which is the first benchmark specifically designed to measure the language priors, or blindness, of LVLMs. It not only includes tests on counterfactual images to assess language priors but also involves a series of tests to evaluate more basic capabilities such as commonsense knowledge, visual perception, and commonsense biases. For each instance in our benchmark, we ensure that all these basic tests are passed before evaluating the language priors, thereby minimizing the influence of other factors on the assessment. The evaluation and analysis of recent LVLMs in our benchmark reveal that almost all models exhibit a significant reliance on language priors, presenting a strong challenge in the field.

VLind-Bench: Measuring Language Priors in Large Vision-Language Models

TL;DR

VLind-Bench introduces a principled, pipelined benchmark to measure language priors in Large Vision-Language Models by decomposing counterfactual reasoning into four tests: commonsense knowledge, visual perception, commonsense bias, and language priors. The data-generation pipeline combines GPT-4-based counterfactual contexts, DALL-E 3 image synthesis, and rigorous human/LLM validation to create a controlled evaluation environment that disentangles priors from other deficits. Experimental results show most LVLMs rely heavily on language priors, with performance improving only for GPT-4o and RLHF-V-enabled models, and an inverse relationship between priors and backbone LLM scale. The work demonstrates that a pipelined evaluation can diagnose grounding problems and guide targeted improvements, including RLHF-V approaches that promote use of visual information for grounding responses. It provides a practical benchmark and diagnostic framework to push LVLM grounding toward more faithful visual reasoning.

Abstract

Large Vision-Language Models (LVLMs) have demonstrated outstanding performance across various multimodal tasks. However, they suffer from a problem known as language prior, where responses are generated based solely on textual patterns while disregarding image information. Addressing the issue of language prior is crucial, as it can lead to undesirable biases or hallucinations when dealing with images that are out of training distribution. Despite its importance, current methods for accurately measuring language priors in LVLMs are poorly studied. Although existing benchmarks based on counterfactual or out-of-distribution images can partially be used to measure language priors, they fail to disentangle language priors from other confounding factors. To this end, we propose a new benchmark called VLind-Bench, which is the first benchmark specifically designed to measure the language priors, or blindness, of LVLMs. It not only includes tests on counterfactual images to assess language priors but also involves a series of tests to evaluate more basic capabilities such as commonsense knowledge, visual perception, and commonsense biases. For each instance in our benchmark, we ensure that all these basic tests are passed before evaluating the language priors, thereby minimizing the influence of other factors on the assessment. The evaluation and analysis of recent LVLMs in our benchmark reveal that almost all models exhibit a significant reliance on language priors, presenting a strong challenge in the field.
Paper Structure (36 sections, 5 equations, 7 figures, 8 tables)

This paper contains 36 sections, 5 equations, 7 figures, 8 tables.

Figures (7)

  • Figure 1: (a) An example from VLind-Bench. Our benchmark consists of four types of questions (i-iv). (b) Evaluation pipeline of VLind-Bench. In the pipeline, both true and false statements of the current stage must be correctly evaluated to proceed to the next stage.
  • Figure 2: Data samples for concept of climate, color, diet, folklore, and habitat.
  • Figure 3: Data samples for concept of history, landmark, location, size, time, and weight.
  • Figure :
  • Figure :
  • ...and 2 more figures