An Examination of the Compositionality of Large Generative Vision-Language Models

Teli Ma; Rong Li; Junwei Liang

An Examination of the Compositionality of Large Generative Vision-Language Models

Teli Ma, Rong Li, Junwei Liang

TL;DR

The paper tackles the limited understanding of compositional reasoning in Generative Vision-Language Models (GVLMs) and identifies a syntactic bias in current benchmarks that inflates VisualGPTScore-based assessments. It introduces SyntaxBias Score, a tool to quantify bias using strong LLMs, and constructs SADE, a de-biased benchmark that combines bias mitigation with a content-focused understanding challenge. Through evaluations of GVLMs such as LLaVA and InstructBLIP on SADE, the work reveals that existing benchmarks overemphasize linguistic priors and that SADE provides a more faithful measure of visio-linguistic compositionality, with InstructBLIP and Emu performing best on several SADE tasks. The proposed benchmark, along with the accompanying code and data, offers a robust framework for fair comparisons and will guide future research toward truly content-grounded multimodal reasoning.

Abstract

With the success of Large Language Models (LLMs), many Generative Vision-Language Models (GVLMs) have been constructed via multimodal instruction tuning. However, the performance of GVLMs in multimodal compositional reasoning remains under-explored. In this paper, we examine both the evaluation metrics (VisualGPTScore, etc.) and current benchmarks for evaluating the compositionality of GVLMs. We identify the syntactical bias in current benchmarks, which is exploited by the linguistic capability of GVLMs. The bias renders VisualGPTScore an insufficient metric for assessing GVLMs. To combat this, we first introduce a SyntaxBias Score, leveraging LLMs to quantify such bias for mitigation. A challenging new task is subsequently added to evaluate the robustness of GVLMs against inherent inclination toward syntactical correctness. Using the bias-mitigated datasets and the new task, we propose a novel benchmark, namely SyntActically DE-biased benchmark (SADE). Our study provides an unbiased benchmark for the compositionality of GVLMs, facilitating future research in this direction (Code and dataset are available at https://github.com/TeleeMa/SADE).

An Examination of the Compositionality of Large Generative Vision-Language Models

TL;DR

Abstract

Paper Structure (29 sections, 8 equations, 8 figures, 4 tables)

This paper contains 29 sections, 8 equations, 8 figures, 4 tables.

Introduction
Background
Generative vision-language models
Vision-language compositionality
Evaluation metrics for multimodal retrieval
Experimental setup
Model choices
Datasets
Evaluation Metric Examination
Sensitivity to bags-of-words
Sensitivity to syntax and contents
Benchmarks Examination
Syntactical bias in current benchmarks
SyntaxBias Score
Mitigate the Bias in Benchmarks
...and 14 more sections

Figures (8)

Figure 1: Box plots of scaled score distributions for original (x1) and perturbed captions (x2-x5,x2: shuffle nouns & adj, x3: shuffle all but nouns & adj, x4: shuffle within trigrams, x5: shuffle trigrams). The distribution gap between the original captions and the shuffled captions is evident for the generative scores, while the contrastive score (BERTScore) is significantly less affected by the order of words. The CLIPScore sub-figure illustrates the distribution of similarity scores generated by the CLIP model, which is compared with the first three sub-figures of LLaVA-7B.
Figure 2: An example of three Cases of captions we construct to validate the preference of syntax and contents.Right caption: the original caption of the image, Shuffled caption: caption that the sentence elements are shuffled, Random caption: fluent and syntactically correct captions from other datasets (COCO), Content caption: caption that keeps only adjectives and nouns to keep the contents like objects and attributes. We present the normalized VisualGPTScore of every reference sentences in this example. The scores of the Right caption and Content caption may be lower compared to the Random caption (0.405, 0.322 vs. 0.432). This indicates that in this example, generative VLMs tend to prioritize syntactically correct sentences over ones that are more relevant to the content.
Figure 3: We report the accuracy of VisualGPTScore based on LLaVA-7B and similarity score based on CLIP in the sampled 507 image-text pairs, each pair is consisted of three cases like the example in Fig. \ref{['fig:syntax']}.
Figure 4: The drop in performance of the LLaVA model when performing compositional reasoning on nonsensical noisy images is minimal in existing benchmarks, whereas the CLIP model exhibits a significant decrease. This indicates current benchmarks are exploited by the LLM part of GVLMs, not effective in measuring the multimodal compositionality.
Figure 5: We visualize the distribution of SyntaxBias Score in current benchmarks. The SyntaxBias Score is defined as the difference between the LLM-based generative scores of positive and negative references. For ARO, VL-CheckList and CREPE, the distribution of the SyntaxBias Scores is situated towards the positive end (to the right of the red line), implying that these benchmarks are biased to positive captions syntactically.
...and 3 more figures

An Examination of the Compositionality of Large Generative Vision-Language Models

TL;DR

Abstract

An Examination of the Compositionality of Large Generative Vision-Language Models

Authors

TL;DR

Abstract

Table of Contents

Figures (8)