GenCeption: Evaluate Vision LLMs with Unlabeled Unimodal Data
Lele Cao, Valentin Buchner, Zineb Senane, Fangkai Yang
TL;DR
GenCeption introduces an annotation-free, unimodal evaluation framework for Multimodal LLMs, enabling measurement of inter-modality semantic coherence and hallucination tendency without relying on costly multimodal annotations. It formalizes the GC@$T$ metric to track semantic drift across iterative description–generation cycles and builds the MMECeption benchmark from MME images to evaluate Vision LLMs against established benchmarks and humans. Empirical results show strong correlations with existing benchmarks, confirm the robustness of GC@$T$ across image generators, and reveal that current VLLMs lag behind human performance, especially on text-intensive tasks. The approach is modality-agnostic and scalable, offering a complementary evaluation tool with potential extensions to other data modalities and more granular skill analyses.
Abstract
Multimodal Large Language Models (MLLMs) are typically assessed using expensive annotated multimodal benchmarks, which often lag behind the rapidly evolving demands of MLLM evaluation. This paper outlines and validates GenCeption, a novel, annotation-free evaluation method that requires only unimodal data to measure inter-modality semantic coherence and inversely assesses MLLMs' tendency to hallucinate. This approach eliminates the need for costly data annotation, minimizes the risk of training data contamination, is expected to result in slower benchmark saturation, and avoids the illusion of emerging abilities. Inspired by the DrawCeption game, GenCeption begins with a non-textual sample and proceeds through iterative description and generation steps. The semantic drift across iterations is quantified using the GC@T metric. While GenCeption is principally applicable to MLLMs across various modalities, this paper focuses on its implementation and validation for Vision LLMs (VLLMs). Based on the GenCeption method, we establish the MMECeption benchmark for evaluating VLLMs, and compare the performance of several popular VLLMs and human annotators. Our empirical results validate GenCeption's effectiveness, demonstrating strong correlations with established VLLM benchmarks. VLLMs still significantly lag behind human performance and struggle especially with text-intensive tasks.
