Table of Contents
Fetching ...

Unveiling the Tapestry of Consistency in Large Vision-Language Models

Yuan Zhang, Fei Xiao, Tao Huang, Chun-Kai Fan, Hongyuan Dong, Jiawen Li, Jiacong Wang, Kuan Cheng, Shanghang Zhang, Haoyuan Guo

TL;DR

A multi-modal benchmark ConBench is provided, to intuitively analyze how LVLMs perform when the solution space of a prompt revolves around a knowledge point, and finds the larger the solution space of the prompt, the lower the accuracy of the answers.

Abstract

Large vision-language models (LVLMs) have recently achieved rapid progress, exhibiting great perception and reasoning abilities concerning visual information. However, when faced with prompts in different sizes of solution spaces, LVLMs fail to always give consistent answers regarding the same knowledge point. This inconsistency of answers between different solution spaces is prevalent in LVLMs and erodes trust. To this end, we provide a multi-modal benchmark ConBench, to intuitively analyze how LVLMs perform when the solution space of a prompt revolves around a knowledge point. Based on the ConBench tool, we are the first to reveal the tapestry and get the following findings: (1) In the discriminate realm, the larger the solution space of the prompt, the lower the accuracy of the answers. (2) Establish the relationship between the discriminative and generative realms: the accuracy of the discriminative question type exhibits a strong positive correlation with its Consistency with the caption. (3) Compared to open-source models, closed-source models exhibit a pronounced bias advantage in terms of Consistency. Eventually, we ameliorate the consistency of LVLMs by trigger-based diagnostic refinement, indirectly improving the performance of their caption. We hope this paper will accelerate the research community in better evaluating their models and encourage future advancements in the consistency domain. The project is available at https://github.com/foundation-multimodal-models/ConBench.

Unveiling the Tapestry of Consistency in Large Vision-Language Models

TL;DR

A multi-modal benchmark ConBench is provided, to intuitively analyze how LVLMs perform when the solution space of a prompt revolves around a knowledge point, and finds the larger the solution space of the prompt, the lower the accuracy of the answers.

Abstract

Large vision-language models (LVLMs) have recently achieved rapid progress, exhibiting great perception and reasoning abilities concerning visual information. However, when faced with prompts in different sizes of solution spaces, LVLMs fail to always give consistent answers regarding the same knowledge point. This inconsistency of answers between different solution spaces is prevalent in LVLMs and erodes trust. To this end, we provide a multi-modal benchmark ConBench, to intuitively analyze how LVLMs perform when the solution space of a prompt revolves around a knowledge point. Based on the ConBench tool, we are the first to reveal the tapestry and get the following findings: (1) In the discriminate realm, the larger the solution space of the prompt, the lower the accuracy of the answers. (2) Establish the relationship between the discriminative and generative realms: the accuracy of the discriminative question type exhibits a strong positive correlation with its Consistency with the caption. (3) Compared to open-source models, closed-source models exhibit a pronounced bias advantage in terms of Consistency. Eventually, we ameliorate the consistency of LVLMs by trigger-based diagnostic refinement, indirectly improving the performance of their caption. We hope this paper will accelerate the research community in better evaluating their models and encourage future advancements in the consistency domain. The project is available at https://github.com/foundation-multimodal-models/ConBench.
Paper Structure (24 sections, 3 equations, 10 figures, 4 tables)

This paper contains 24 sections, 3 equations, 10 figures, 4 tables.

Figures (10)

  • Figure 1: Here is the overview of our paper. Part (a) indicates two examples of Inconsistency between discriminative answers and generative captions, where the answers marked in blue contradict the answers marked in purple. Part (b) shows the Consistency evaluation method Conbench and its discriminative top three leaderboard. Part(c) reveals the main three findings built upon ConBench.
  • Figure 2: Overview of 19 evaluation detailed categories in ConBench.
  • Figure 3: The prompt for generation of discriminative questions. Please zoom in to view.
  • Figure 4: The pipeline of judging Consistency between caption and discriminative answers via GPT/GPT4. Please zoom in to view the prompt.
  • Figure 5: The confidence and logits of answers of LLaVA-13B and MGM-13B.
  • ...and 5 more figures