InsightVision: A Comprehensive, Multi-Level Chinese-based Benchmark for Evaluating Implicit Visual Semantics in Large Vision Language Models
Xiaofei Yin, Yijie Hong, Ya Guo, Yi Tu, Weiqiang Wang, Gongshen Liu, Huijia zhu
TL;DR
InsightVision presents a Chinese, multi-level benchmark for implicit visual semantics in LVLMs, addressing the gap where current benchmarks emphasize surface tasks. It introduces a semi-automatic dataset construction pipeline, yielding 2,500 image-question samples (16,220 QA items) across surface-level, symbolic meaning, background knowledge, and implicit meaning tasks. Evaluations of 15 open-source LVLMs plus GPT-4o show large performance gaps on implicit meaning, with the best models reaching around 60% accuracy versus humans around 74%, highlighting the challenge of deeper semantic understanding. The work demonstrates that incorporating multi-level image descriptions can markedly improve implicit meaning comprehension and suggests scaling alone is insufficient, urging architectural and training innovations. The dataset and code are to be released to foster further research in nuanced visual semantics and multimodal reasoning.
Abstract
In the evolving landscape of multimodal language models, understanding the nuanced meanings conveyed through visual cues - such as satire, insult, or critique - remains a significant challenge. Existing evaluation benchmarks primarily focus on direct tasks like image captioning or are limited to a narrow set of categories, such as humor or satire, for deep semantic understanding. To address this gap, we introduce, for the first time, a comprehensive, multi-level Chinese-based benchmark designed specifically for evaluating the understanding of implicit meanings in images. This benchmark is systematically categorized into four subtasks: surface-level content understanding, symbolic meaning interpretation, background knowledge comprehension, and implicit meaning comprehension. We propose an innovative semi-automatic method for constructing datasets, adhering to established construction protocols. Using this benchmark, we evaluate 15 open-source large vision language models (LVLMs) and GPT-4o, revealing that even the best-performing model lags behind human performance by nearly 14% in understanding implicit meaning. Our findings underscore the intrinsic challenges current LVLMs face in grasping nuanced visual semantics, highlighting significant opportunities for future research and development in this domain. We will publicly release our InsightVision dataset, code upon acceptance of the paper.
