MM-SAP: A Comprehensive Benchmark for Assessing Self-Awareness of Multimodal Large Language Models in Perception

Yuhao Wang; Yusheng Liao; Heyang Liu; Hongcheng Liu; Yu Wang; Yanfeng Wang

MM-SAP: A Comprehensive Benchmark for Assessing Self-Awareness of Multimodal Large Language Models in Perception

Yuhao Wang, Yusheng Liao, Heyang Liu, Hongcheng Liu, Yu Wang, Yanfeng Wang

TL;DR

The paper defines self-awareness in multimodal perception by extending the knowledge-quadrant framework to include visual inputs and introduces MM-SAP, a three-subdataset VQA benchmark (BasicVisQA, KnowVisQA, BeyondVisQA) to evaluate what MLLMs know and do not know about images. It presents rigorous evaluation of 13 MLLMs, revealing pronounced gaps between closed-source and open-source models in recognizing knowledge boundaries and handling unknowns, as well as varying refusal behaviours. The study shows that while some models excel at known visual information and refusing unknowns, overall self-awareness in perception remains limited, underscoring the need for strategies to mitigate hallucinations and improve reliability. By making datasets and code available, MM-SAP provides a practical framework to push toward more trustworthy, self-aware multimodal systems.

Abstract

Recent advancements in Multimodal Large Language Models (MLLMs) have demonstrated exceptional capabilities in visual perception and understanding. However, these models also suffer from hallucinations, which limit their reliability as AI systems. We believe that these hallucinations are partially due to the models' struggle with understanding what they can and cannot perceive from images, a capability we refer to as self-awareness in perception. Despite its importance, this aspect of MLLMs has been overlooked in prior studies. In this paper, we aim to define and evaluate the self-awareness of MLLMs in perception. To do this, we first introduce the knowledge quadrant in perception, which helps define what MLLMs know and do not know about images. Using this framework, we propose a novel benchmark, the Self-Awareness in Perception for MLLMs (MM-SAP), specifically designed to assess this capability. We apply MM-SAP to a variety of popular MLLMs, offering a comprehensive analysis of their self-awareness and providing detailed insights. The experiment results reveal that current MLLMs possess limited self-awareness capabilities, pointing to a crucial area for future advancement in the development of trustworthy MLLMs. Code and data are available at https://github.com/YHWmz/MM-SAP.

MM-SAP: A Comprehensive Benchmark for Assessing Self-Awareness of Multimodal Large Language Models in Perception

TL;DR

Abstract

Paper Structure (24 sections, 8 equations, 9 figures, 7 tables)

This paper contains 24 sections, 8 equations, 9 figures, 7 tables.

Introduction
Related work
Self-awareness of LLMs
Hallucination on MLLMs
Benchmarks for MLLMs
Self-awareness in Perception
Knowledge Quadrant for MLLMs
MM-SAP Benchmark
BasicVisQA
KnowVisQA
BeyondVisQA
Experiments
Evaluation Strategy
Inference Settings
Main Results
...and 9 more sections

Figures (9)

Figure 1: Self-awareness of a trustworthy MLLM. A trustful MLLM can be aware of what it knows and what it does not know. Top: For the questions it knows, it would provide correct answers as a reliable AI system. Bottom: It can recognize unknown questions and refuse to give answers, preventing the generation of incorrect responses.
Figure 2: Knowledge quadrants for LLMs and MLLMs. Taking the visual information into account, we expand the original quadrant horizontally to develop the knowledge quadrant for MLLMs.
Figure 3: Overview of MM-SAP. Our MM-SAP benchmark comprises three sub-datasets, namely BasicVisQA, KnowVisQA, and BeyondVisQA, and includes a total of 19 subtasks. The white dashed line indicates that the delineation between 'Knowns' and 'Unknowns' is model-specific. The number in square brackets in the middle ring represents the size of the subset, while the number in the outer ring indicates the proportion of each subtask within the subset.
Figure 4: Examples for each sub-dataset. In MM-SAP, all samples include a refusal option. In BeyondVisQA, the model can only choose the refusal option. In KnowVisQA, the model has the option to select either the correct answer or to correctly refuse to answer. In BasicVisQA, the model is restricted to choosing the correct option only.
Figure 5: Scores distribution of MLLMs. The x-axis and y-axis represent the $score_{kk}$ and $score_{ku}$ respectively. The dashed lines in the figure represent the isoline of the $score_{sa}$.
...and 4 more figures

MM-SAP: A Comprehensive Benchmark for Assessing Self-Awareness of Multimodal Large Language Models in Perception

TL;DR

Abstract

MM-SAP: A Comprehensive Benchmark for Assessing Self-Awareness of Multimodal Large Language Models in Perception

Authors

TL;DR

Abstract

Table of Contents

Figures (9)