Table of Contents
Fetching ...

Dynamic Multimodal Evaluation with Flexible Complexity by Vision-Language Bootstrapping

Yue Yang, Shuibai Zhang, Wenqi Shao, Kaipeng Zhang, Yi Bin, Yu Wang, Ping Luo

TL;DR

This paper identifies critical flaws in static Vision-Language Model benchmarks, notably data contamination and fixed complexity, and proposes Vision-Language Bootstrapping (VLB) as a dynamic multimodal evaluation framework. VLB uses image and language bootstrapping to generate diverse, complexity-controlled VQA variants, guarded by a judge that preserves answer correctness, thereby reducing leakage and enabling evaluation that co-evolves with LVLM capabilities. Empirical results across SEEDBench, MMBench, MME, and other benchmarks show that dynamic variants reveal performance gaps, with robust findings on how visual attention and language understanding affect LVLMs under varied user-like conditions. The work provides a practical, generalizable approach to dynamically evaluating LVLMs and points to improvements in robustness and user-adaptivity, with potential applicability to other multimodal tasks beyond VQA.

Abstract

Large Vision-Language Models (LVLMs) have demonstrated remarkable capabilities across multimodal tasks such as visual perception and reasoning, leading to good performance on various multimodal evaluation benchmarks. However, these benchmarks keep a static nature and overlap with the pre-training data, resulting in fixed complexity constraints and data contamination issues. This raises the concern regarding the validity of the evaluation. To address these two challenges, we introduce a dynamic multimodal evaluation protocol called Vision-Language Bootstrapping (VLB). VLB provides a robust and comprehensive assessment for LVLMs with reduced data contamination and flexible complexity. To this end, VLB dynamically generates new visual question-answering samples through a multimodal bootstrapping module that modifies both images and language, while ensuring that newly generated samples remain consistent with the original ones by a judge module. By composing various bootstrapping strategies, VLB offers dynamic variants of existing benchmarks with diverse complexities, enabling the evaluation to co-evolve with the ever-evolving capabilities of LVLMs. Extensive experimental results across multiple benchmarks, including SEEDBench, MMBench, and MME, show that VLB significantly reduces data contamination and exposes performance limitations of LVLMs.

Dynamic Multimodal Evaluation with Flexible Complexity by Vision-Language Bootstrapping

TL;DR

This paper identifies critical flaws in static Vision-Language Model benchmarks, notably data contamination and fixed complexity, and proposes Vision-Language Bootstrapping (VLB) as a dynamic multimodal evaluation framework. VLB uses image and language bootstrapping to generate diverse, complexity-controlled VQA variants, guarded by a judge that preserves answer correctness, thereby reducing leakage and enabling evaluation that co-evolves with LVLM capabilities. Empirical results across SEEDBench, MMBench, MME, and other benchmarks show that dynamic variants reveal performance gaps, with robust findings on how visual attention and language understanding affect LVLMs under varied user-like conditions. The work provides a practical, generalizable approach to dynamically evaluating LVLMs and points to improvements in robustness and user-adaptivity, with potential applicability to other multimodal tasks beyond VQA.

Abstract

Large Vision-Language Models (LVLMs) have demonstrated remarkable capabilities across multimodal tasks such as visual perception and reasoning, leading to good performance on various multimodal evaluation benchmarks. However, these benchmarks keep a static nature and overlap with the pre-training data, resulting in fixed complexity constraints and data contamination issues. This raises the concern regarding the validity of the evaluation. To address these two challenges, we introduce a dynamic multimodal evaluation protocol called Vision-Language Bootstrapping (VLB). VLB provides a robust and comprehensive assessment for LVLMs with reduced data contamination and flexible complexity. To this end, VLB dynamically generates new visual question-answering samples through a multimodal bootstrapping module that modifies both images and language, while ensuring that newly generated samples remain consistent with the original ones by a judge module. By composing various bootstrapping strategies, VLB offers dynamic variants of existing benchmarks with diverse complexities, enabling the evaluation to co-evolve with the ever-evolving capabilities of LVLMs. Extensive experimental results across multiple benchmarks, including SEEDBench, MMBench, and MME, show that VLB significantly reduces data contamination and exposes performance limitations of LVLMs.

Paper Structure

This paper contains 33 sections, 15 figures, 15 tables.

Figures (15)

  • Figure 1: (a) shows that some images in evaluation sets can be exactly found in the training set and their corresponding questions can be solved by the captions of similar training images. (b) compares our dynamic multimodal evaluation with the previous static evaluation. We can see that dynamic evaluation can create various variants upon static benchmarks with flexible complexity.
  • Figure 2: (a) Existing benchmarks have severe overlap on images with pre-training data. (b) Questions of the contaminated evaluation image can also be solved by the caption of similar images from the training set.
  • Figure 3: Illustration of our proposed dynamic multimodal evaluation framework, Vision-Language Bootstrapping (VLB). (a) demonstrates how we derive insights from real user interactions with LVLMs, where users possess different visual attention and language understanding from diverse identities. (b) highlights the role of VLB's judge module in ensuring that generated images and questions maintain consistent with the original. (c) provides an example of VLB transforming a sample through image and language bootstrapping. Additionally, VLB can generate new, increasingly complex samples through bootstrapping composition.
  • Figure 4: Image bootstrapping strategies: Starting from an original image, route $\mathcal{V}_1$, $\mathcal{V}_2$, $\mathcal{V}_3$ represents the process of adding new objects, removing existing objects, and expanding original images.
  • Figure 5: Results of composing image and language bootstrapping strategies on MMBench.
  • ...and 10 more figures