Table of Contents
Fetching ...

II-Bench: An Image Implication Understanding Benchmark for Multimodal Large Language Models

Ziqiang Liu, Feiteng Fang, Xi Feng, Xinrun Du, Chenhao Zhang, Zekun Wang, Yuelin Bai, Qixuan Zhao, Liyang Fan, Chengguang Gan, Hongquan Lin, Jiaming Li, Yuansheng Ni, Haihong Wu, Yaswanth Narsupalli, Zhigang Zheng, Chengming Li, Xiping Hu, Ruifeng Xu, Xiaojun Chen, Min Yang, Jiaheng Liu, Ruibo Liu, Wenhao Huang, Ge Zhang, Shiwen Ni

TL;DR

II-Bench introduces the first Image Implication Understanding Benchmark to probe higher-order perception in multimodal LLMs. It provides 1,222 images across six domains with 1,434 questions designed to test metaphorical and implicit content, annotated via crowdsourcing. The study shows a substantial gap between human and model performance, with closed-source models generally performing better than open-source ones, and reveals that emotion-polarity prompts significantly boost accuracy while chain-of-thought prompts do not. The results emphasize domain- and emotion-dependent weaknesses (notably in Art and Psychology) and suggest that incorporating emotional cues is a promising direction for advancing multimodal reasoning toward more robust AGI capabilities.

Abstract

The rapid advancements in the development of multimodal large language models (MLLMs) have consistently led to new breakthroughs on various benchmarks. In response, numerous challenging and comprehensive benchmarks have been proposed to more accurately assess the capabilities of MLLMs. However, there is a dearth of exploration of the higher-order perceptual capabilities of MLLMs. To fill this gap, we propose the Image Implication understanding Benchmark, II-Bench, which aims to evaluate the model's higher-order perception of images. Through extensive experiments on II-Bench across multiple MLLMs, we have made significant findings. Initially, a substantial gap is observed between the performance of MLLMs and humans on II-Bench. The pinnacle accuracy of MLLMs attains 74.8%, whereas human accuracy averages 90%, peaking at an impressive 98%. Subsequently, MLLMs perform worse on abstract and complex images, suggesting limitations in their ability to understand high-level semantics and capture image details. Finally, it is observed that most models exhibit enhanced accuracy when image sentiment polarity hints are incorporated into the prompts. This observation underscores a notable deficiency in their inherent understanding of image sentiment. We believe that II-Bench will inspire the community to develop the next generation of MLLMs, advancing the journey towards expert artificial general intelligence (AGI). II-Bench is publicly available at https://huggingface.co/datasets/m-a-p/II-Bench.

II-Bench: An Image Implication Understanding Benchmark for Multimodal Large Language Models

TL;DR

II-Bench introduces the first Image Implication Understanding Benchmark to probe higher-order perception in multimodal LLMs. It provides 1,222 images across six domains with 1,434 questions designed to test metaphorical and implicit content, annotated via crowdsourcing. The study shows a substantial gap between human and model performance, with closed-source models generally performing better than open-source ones, and reveals that emotion-polarity prompts significantly boost accuracy while chain-of-thought prompts do not. The results emphasize domain- and emotion-dependent weaknesses (notably in Art and Psychology) and suggest that incorporating emotional cues is a promising direction for advancing multimodal reasoning toward more robust AGI capabilities.

Abstract

The rapid advancements in the development of multimodal large language models (MLLMs) have consistently led to new breakthroughs on various benchmarks. In response, numerous challenging and comprehensive benchmarks have been proposed to more accurately assess the capabilities of MLLMs. However, there is a dearth of exploration of the higher-order perceptual capabilities of MLLMs. To fill this gap, we propose the Image Implication understanding Benchmark, II-Bench, which aims to evaluate the model's higher-order perception of images. Through extensive experiments on II-Bench across multiple MLLMs, we have made significant findings. Initially, a substantial gap is observed between the performance of MLLMs and humans on II-Bench. The pinnacle accuracy of MLLMs attains 74.8%, whereas human accuracy averages 90%, peaking at an impressive 98%. Subsequently, MLLMs perform worse on abstract and complex images, suggesting limitations in their ability to understand high-level semantics and capture image details. Finally, it is observed that most models exhibit enhanced accuracy when image sentiment polarity hints are incorporated into the prompts. This observation underscores a notable deficiency in their inherent understanding of image sentiment. We believe that II-Bench will inspire the community to develop the next generation of MLLMs, advancing the journey towards expert artificial general intelligence (AGI). II-Bench is publicly available at https://huggingface.co/datasets/m-a-p/II-Bench.
Paper Structure (40 sections, 82 figures, 11 tables)

This paper contains 40 sections, 82 figures, 11 tables.

Figures (82)

  • Figure 1: Implication: a significant gap exists between humans and MLLMs on II-Bench.
  • Figure 2: Composition of II-Bench.
  • Figure 3: II-Bench examples sampled from each domain. The pictures include life, art, society, psychology, environment and other domains. Understanding these images and completing the corresponding questions require a certain level of comprehension.
  • Figure 4: GPT-4V error response distribution.
  • Figure 5: II-Bench specific image type and domain statistics.
  • ...and 77 more figures