Table of Contents
Fetching ...

GenCeption: Evaluate Vision LLMs with Unlabeled Unimodal Data

Lele Cao, Valentin Buchner, Zineb Senane, Fangkai Yang

TL;DR

GenCeption introduces an annotation-free, unimodal evaluation framework for Multimodal LLMs, enabling measurement of inter-modality semantic coherence and hallucination tendency without relying on costly multimodal annotations. It formalizes the GC@$T$ metric to track semantic drift across iterative description–generation cycles and builds the MMECeption benchmark from MME images to evaluate Vision LLMs against established benchmarks and humans. Empirical results show strong correlations with existing benchmarks, confirm the robustness of GC@$T$ across image generators, and reveal that current VLLMs lag behind human performance, especially on text-intensive tasks. The approach is modality-agnostic and scalable, offering a complementary evaluation tool with potential extensions to other data modalities and more granular skill analyses.

Abstract

Multimodal Large Language Models (MLLMs) are typically assessed using expensive annotated multimodal benchmarks, which often lag behind the rapidly evolving demands of MLLM evaluation. This paper outlines and validates GenCeption, a novel, annotation-free evaluation method that requires only unimodal data to measure inter-modality semantic coherence and inversely assesses MLLMs' tendency to hallucinate. This approach eliminates the need for costly data annotation, minimizes the risk of training data contamination, is expected to result in slower benchmark saturation, and avoids the illusion of emerging abilities. Inspired by the DrawCeption game, GenCeption begins with a non-textual sample and proceeds through iterative description and generation steps. The semantic drift across iterations is quantified using the GC@T metric. While GenCeption is principally applicable to MLLMs across various modalities, this paper focuses on its implementation and validation for Vision LLMs (VLLMs). Based on the GenCeption method, we establish the MMECeption benchmark for evaluating VLLMs, and compare the performance of several popular VLLMs and human annotators. Our empirical results validate GenCeption's effectiveness, demonstrating strong correlations with established VLLM benchmarks. VLLMs still significantly lag behind human performance and struggle especially with text-intensive tasks.

GenCeption: Evaluate Vision LLMs with Unlabeled Unimodal Data

TL;DR

GenCeption introduces an annotation-free, unimodal evaluation framework for Multimodal LLMs, enabling measurement of inter-modality semantic coherence and hallucination tendency without relying on costly multimodal annotations. It formalizes the GC@ metric to track semantic drift across iterative description–generation cycles and builds the MMECeption benchmark from MME images to evaluate Vision LLMs against established benchmarks and humans. Empirical results show strong correlations with existing benchmarks, confirm the robustness of GC@ across image generators, and reveal that current VLLMs lag behind human performance, especially on text-intensive tasks. The approach is modality-agnostic and scalable, offering a complementary evaluation tool with potential extensions to other data modalities and more granular skill analyses.

Abstract

Multimodal Large Language Models (MLLMs) are typically assessed using expensive annotated multimodal benchmarks, which often lag behind the rapidly evolving demands of MLLM evaluation. This paper outlines and validates GenCeption, a novel, annotation-free evaluation method that requires only unimodal data to measure inter-modality semantic coherence and inversely assesses MLLMs' tendency to hallucinate. This approach eliminates the need for costly data annotation, minimizes the risk of training data contamination, is expected to result in slower benchmark saturation, and avoids the illusion of emerging abilities. Inspired by the DrawCeption game, GenCeption begins with a non-textual sample and proceeds through iterative description and generation steps. The semantic drift across iterations is quantified using the GC@T metric. While GenCeption is principally applicable to MLLMs across various modalities, this paper focuses on its implementation and validation for Vision LLMs (VLLMs). Based on the GenCeption method, we establish the MMECeption benchmark for evaluating VLLMs, and compare the performance of several popular VLLMs and human annotators. Our empirical results validate GenCeption's effectiveness, demonstrating strong correlations with established VLLM benchmarks. VLLMs still significantly lag behind human performance and struggle especially with text-intensive tasks.
Paper Structure (13 sections, 4 equations, 7 figures, 4 tables, 1 algorithm)

This paper contains 13 sections, 4 equations, 7 figures, 4 tables, 1 algorithm.

Figures (7)

  • Figure 1: An illustration of the $t$-th iteration in the GenCeption evaluation procedure for VLLMs. Using the image modality as an example, the process begins with an existing image $\mathbf{X}^{(0)}$ sourced from a unimodal image dataset for the first iteration ($t$=1). The VLLM provides a detailed description of the image, which is then used by an image generator to produce $\mathbf{X}^{(t)}$.
  • Figure 2: Evaluation results of GC@$3$, MME, HallusionBench and OpenCompass on visual(Vis)-intensive and textual(Text)-intensive images. Best results per metric and category (over different MLLMs) are bolded.
  • Figure 3: Correlation Matrix of GC@$1$ and GC@$3$ scores on MMECeption, and several other benchmarks.
  • Figure 4: Demonstration of GenCeption evaluation procedure: the images generated over 3 GenCeption iterations for several MLLMs. The similarity $s^{(t)}$ scores (to the seed image) are shown on the top of images; GC@$1$ and GC@$3$ scores are printed on the bottom of the first and third image, respectively.
  • Figure 5: Example seed images from the visually (Figure \ref{['fig:appendix-visual']}) and textually (Figure \ref{['fig:appendix-textual']}) intensive groups, along with their associated metadata.
  • ...and 2 more figures