Is Cognition Consistent with Perception? Assessing and Mitigating Multimodal Knowledge Conflicts in Document Understanding

Zirui Shao; Feiyu Gao; Zhaoqing Zhu; Chuwei Luo; Hangdi Xing; Zhi Yu; Qi Zheng; Ming Yan; Jiajun Bu

Is Cognition Consistent with Perception? Assessing and Mitigating Multimodal Knowledge Conflicts in Document Understanding

Zirui Shao, Feiyu Gao, Zhaoqing Zhu, Chuwei Luo, Hangdi Xing, Zhi Yu, Qi Zheng, Ming Yan, Jiajun Bu

TL;DR

This work identifies Cognition and Perception (C&P) knowledge conflicts in multimodal document understanding, showing that state-of-the-art MLLMs often produce cognition that mismatches OCR-perceived content. It formalizes C&P consistency, constructs evaluation samples across five benchmarks, and reveals substantial conflicts even in leading models. The authors propose Multimodal Knowledge Consistency Fine-tuning, comprising C&P Link Tokens and a C&P Connector, to strengthen the linkage between perception and cognition. Across three open-source MLLMs and five datasets, the method reduces C&P conflicts and yields concurrent gains in both cognitive and perceptual tasks, highlighting improved integration and explainability. The work also provides ablations, error analyses, and case studies to illuminate the sources of residual conflicts and the benefits of cross-verification between perceptual and cognitive knowledge.

Abstract

Multimodal large language models (MLLMs) have shown impressive capabilities in document understanding, a rapidly growing research area with significant industrial demand. As a multimodal task, document understanding requires models to possess both perceptual and cognitive abilities. However, due to different types of annotation noise in training, current MLLMs often face conflicts between perception and cognition. Taking a document VQA task (cognition) as an example, an MLLM might generate answers that do not match the corresponding visual content identified by its OCR (perception). This conflict suggests that the MLLM might struggle to establish an intrinsic connection between the information it "sees" and what it "understands". Such conflicts challenge the intuitive notion that cognition is consistent with perception, hindering the performance and explainability of MLLMs. In this paper, we define the conflicts between cognition and perception as Cognition and Perception (C&P) knowledge conflicts, a form of multimodal knowledge conflict, and systematically assess them with a focus on document understanding. Our analysis reveals that even GPT-4o, a leading MLLM, achieves only 75.26% C&P consistency. To mitigate the C&P knowledge conflicts, we propose a novel method called Multimodal Knowledge Consistency Fine-tuning. Our method reduces C&P knowledge conflicts across all tested MLLMs and enhances their performance in both cognitive and perceptual tasks.

Is Cognition Consistent with Perception? Assessing and Mitigating Multimodal Knowledge Conflicts in Document Understanding

TL;DR

Abstract

Is Cognition Consistent with Perception? Assessing and Mitigating Multimodal Knowledge Conflicts in Document Understanding

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (6)