MM-SAP: A Comprehensive Benchmark for Assessing Self-Awareness of Multimodal Large Language Models in Perception
Yuhao Wang, Yusheng Liao, Heyang Liu, Hongcheng Liu, Yu Wang, Yanfeng Wang
TL;DR
The paper defines self-awareness in multimodal perception by extending the knowledge-quadrant framework to include visual inputs and introduces MM-SAP, a three-subdataset VQA benchmark (BasicVisQA, KnowVisQA, BeyondVisQA) to evaluate what MLLMs know and do not know about images. It presents rigorous evaluation of 13 MLLMs, revealing pronounced gaps between closed-source and open-source models in recognizing knowledge boundaries and handling unknowns, as well as varying refusal behaviours. The study shows that while some models excel at known visual information and refusing unknowns, overall self-awareness in perception remains limited, underscoring the need for strategies to mitigate hallucinations and improve reliability. By making datasets and code available, MM-SAP provides a practical framework to push toward more trustworthy, self-aware multimodal systems.
Abstract
Recent advancements in Multimodal Large Language Models (MLLMs) have demonstrated exceptional capabilities in visual perception and understanding. However, these models also suffer from hallucinations, which limit their reliability as AI systems. We believe that these hallucinations are partially due to the models' struggle with understanding what they can and cannot perceive from images, a capability we refer to as self-awareness in perception. Despite its importance, this aspect of MLLMs has been overlooked in prior studies. In this paper, we aim to define and evaluate the self-awareness of MLLMs in perception. To do this, we first introduce the knowledge quadrant in perception, which helps define what MLLMs know and do not know about images. Using this framework, we propose a novel benchmark, the Self-Awareness in Perception for MLLMs (MM-SAP), specifically designed to assess this capability. We apply MM-SAP to a variety of popular MLLMs, offering a comprehensive analysis of their self-awareness and providing detailed insights. The experiment results reveal that current MLLMs possess limited self-awareness capabilities, pointing to a crucial area for future advancement in the development of trustworthy MLLMs. Code and data are available at https://github.com/YHWmz/MM-SAP.
