Table of Contents
Fetching ...

Can Large Multimodal Models Uncover Deep Semantics Behind Images?

Yixin Yang, Zheng Li, Qingxiu Dong, Heming Xia, Zhifang Sui

TL;DR

DeepEval tackles the challenge of uncovering deep semantics in images by introducing a cartoon-focused benchmark with three MCQ-based subtasks that assess fine-grained description, in-depth title meaning, and deep semantic understanding. The study benchmarks nine open-source Large Multimodal Models and GPT-4V, revealing a substantial AI–human gap, with model size and the use of image descriptions positively affecting performance. Key findings show that surface descriptions can boost deep semantic comprehension and that larger parameter counts generally yield more stable and capable models, though deep semantics remain the most difficult task. This work provides a systematic framework and dataset to push toward truly semantically aware multimodal systems and highlights directions for expanding benchmarks beyond cartoons to diverse image types.

Abstract

Understanding the deep semantics of images is essential in the era dominated by social media. However, current research works primarily on the superficial description of images, revealing a notable deficiency in the systematic investigation of the inherent deep semantics. In this work, we introduce DEEPEVAL, a comprehensive benchmark to assess Large Multimodal Models' (LMMs) capacities of visual deep semantics. DEEPEVAL includes human-annotated dataset and three progressive subtasks: fine-grained description selection, in-depth title matching, and deep semantics understanding. Utilizing DEEPEVAL, we evaluate 9 open-source LMMs and GPT-4V(ision). Our evaluation demonstrates a substantial gap between the deep semantic comprehension capabilities of existing LMMs and humans. For example, GPT-4V is 30% behind humans in understanding deep semantics, even though it achieves human-comparable performance in image description. Further analysis reveals that LMM performance on DEEPEVAL varies according to the specific facets of deep semantics explored, indicating the fundamental challenges remaining in developing LMMs.

Can Large Multimodal Models Uncover Deep Semantics Behind Images?

TL;DR

DeepEval tackles the challenge of uncovering deep semantics in images by introducing a cartoon-focused benchmark with three MCQ-based subtasks that assess fine-grained description, in-depth title meaning, and deep semantic understanding. The study benchmarks nine open-source Large Multimodal Models and GPT-4V, revealing a substantial AI–human gap, with model size and the use of image descriptions positively affecting performance. Key findings show that surface descriptions can boost deep semantic comprehension and that larger parameter counts generally yield more stable and capable models, though deep semantics remain the most difficult task. This work provides a systematic framework and dataset to push toward truly semantically aware multimodal systems and highlights directions for expanding benchmarks beyond cartoons to diverse image types.

Abstract

Understanding the deep semantics of images is essential in the era dominated by social media. However, current research works primarily on the superficial description of images, revealing a notable deficiency in the systematic investigation of the inherent deep semantics. In this work, we introduce DEEPEVAL, a comprehensive benchmark to assess Large Multimodal Models' (LMMs) capacities of visual deep semantics. DEEPEVAL includes human-annotated dataset and three progressive subtasks: fine-grained description selection, in-depth title matching, and deep semantics understanding. Utilizing DEEPEVAL, we evaluate 9 open-source LMMs and GPT-4V(ision). Our evaluation demonstrates a substantial gap between the deep semantic comprehension capabilities of existing LMMs and humans. For example, GPT-4V is 30% behind humans in understanding deep semantics, even though it achieves human-comparable performance in image description. Further analysis reveals that LMM performance on DEEPEVAL varies according to the specific facets of deep semantics explored, indicating the fundamental challenges remaining in developing LMMs.
Paper Structure (36 sections, 9 figures, 7 tables)

This paper contains 36 sections, 9 figures, 7 tables.

Figures (9)

  • Figure 1: An example from the DeepEval dataset includes annotated description, annotated title, and the corresponding multiple-choice question for deep semantics from the Deep Semantics Understanding Task.
  • Figure 2: The distribution of six categories of DeepEval dataset.
  • Figure 3: Schematic diagram of DeepEval dataset construction process including three stages: Image Collection & Human Annotation, Quality Control and Distractor Generation.
  • Figure 4: Random samples of answers chosen by CogVLM and MiniGPT-4, along with the standard answers, covering three categories: Touching, Inspiring, and Humorous, with one sample from each category.
  • Figure 5: The radar charts represent the performance of several typical models in understanding images across different categories in our three tasks.
  • ...and 4 more figures